# Table of Contents
* &nbsp;
	* [imports](#imports)
	* [simple functions](#simple-functions)
* [Load data](#Load-data)
	* [setting paths](#setting-paths)
	* [parsed content and raw questions and descriptions](#parsed-content-and-raw-questions-and-descriptions)
	* [localization and recognition](#localization-and-recognition)
	* [building spellings and grammar](#building-spellings-and-grammar)
* [Clean and prepare data](#Clean-and-prepare-data)
	* [extract media links](#extract-media-links)
	* [remove non-conforming content](#remove-non-conforming-content)
		* [code](#code)
		* [run](#run)
	* [remove recognition and localization errors](#remove-recognition-and-localization-errors)
* [Add image annotations](#Add-image-annotations)
	* [localization](#localization)
	* [recognition](#recognition)
		* [code](#code)
		* [run](#run)
		* [hide](#hide)
* [Integrate diagram questions and descriptions](#Integrate-diagram-questions-and-descriptions)
	* [match diagram topics to lessons](#match-diagram-topics-to-lessons)
		* [code](#code)
		* [run](#run)
		* [hide](#hide)
	* [merge questions](#merge-questions)
		* [code](#code)
		* [run](#run)
		* [hide](#hide)
	* [merge descriptions](#merge-descriptions)
		* [hide](#hide)
	* [Apply spelling and grammar fixes](#Apply-spelling-and-grammar-fixes)
		* [code](#code)
		* [run](#run)
		* [hide](#hide)
* [Topic key collisions](#Topic-key-collisions)
* [Refinements to make](#Refinements-to-make)
* [End](#End)


## imports

In [2]:
%%capture
import matplotlib as mpl
mpl.use("Agg")
import matplotlib.pylab as plt
#%matplotlib notebook
%matplotlib inline
%load_ext base16_mplrc
# %base16_mplrc light solarized
%base16_mplrc dark solarized
plt.rcParams['grid.linewidth'] = 0
plt.rcParams['figure.figsize'] = (16.0, 10.0)

import numpy as np
import pandas as pd
import scipy.stats as st
from scipy.stats.mstats import mode

import itertools
import math
from collections import Counter, defaultdict, OrderedDict
%load_ext autoreload
%autoreload 2

import cv2
import pprint
import pickle
import json
import requests
import io
import sys
import os
from binascii import b2a_hex
import base64
from wand.image import Image as WImage
from IPython.display import display
from IPython.core.display import HTML
import PIL.Image as Image
from copy import deepcopy
import glob
import random

import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import wordpunct_tokenize
import language_check
import enchant
import difflib
import diff_match_patch
import fuzzywuzzy.fuzz as fuzz
import re
import jsonschema
from pdfextraction.ck12_schema import ck12_schema as schema

## simple functions

In [155]:
def write_file(filename, data_dict, output_dir='output_data_from_nbs'):
    with open(os.path.join(output_dir, filename), 'w') as f:
        json.dump(data_dict, f, indent=4, sort_keys=True)
        
def get_img_n(image_name):
    return [re.findall("[0-9]+", image_name)][0][0]

def clean_list(dir_path):
    hidden_removed = filter(lambda f: not f.startswith('.'), os.listdir(dir_path))
    return [topic.replace('_diagram', '') for topic in hidden_removed]

# Load data

## setting paths

In [4]:
output_dir = 'output_data_from_nbs/'
raw_data_dir = '../spare5_produced_data/data/'
raw_dq_file = 'ai2_testquestions_20161005.csv'
s5_raw_decriptions = 'ai2_diagramdescriptions_20161018.csv'
ai2_raw_decriptions = 'our_description.csv'
glossary_path = os.path.join(output_dir, 'flexbook_glossary.pkl')

turk_proc_dir = '/Users/schwenk/wrk/stb/diagram_questions/turk_processing/'
metadata_dir = turk_proc_dir + 'store_hit_results_metadata/'
lc_results_dir = 'loc_group_3'
box_loc_joined = 'loc_annotations'
recog_results_dir = 'group_latest_combined'

box_choices_1_dir = 'final_text_boxes_fixed'
box_choices_2_dir = 'final_text_boxes_pass_2'
none_agree = 'no_turkers_agree_lookup.pkl'
two_agree_lookup = 'two_turkers_agree_lookup.pkl'
all_agree_lookup = 'user_diag_loopkup.pkl'

recog_performed = '/Users/schwenk/wrk/stb/diagram_questions/turk_processing/final_diagrams/'
all_dir = '/Users/schwenk/wrk/stb/ai2-vision-textbook-dataset/diagrams/tqa_diagrams_v0.9/'
pruned_dir = '/Users/schwenk/wrk/stb/ai2-vision-textbook-dataset/diagrams/dataset_Sep_27/tqa_diagrams_v0.9_question_images/'
description_dir = '/Users/schwenk/wrk/stb/spare5_produced_data/tqa_diagrams_v0.9_inbook/'

## parsed content and raw questions and descriptions

In [191]:
%%capture
# load complete text v 3.5, raw diagram questions and descriptions
with open(output_dir + 'ck12_dataset_beta_v3_5.json', 'r') as f:
    ck12_combined_dataset_raw = json.load(f)
with open(output_dir + 'ck12_flexbook_only_beta_v3.json', 'r') as f:
    flexbook_ds = json.load(f)
with open(output_dir + 'ck12_lessons_only_beta_v3.json', 'r') as f:
    lessons_ds = json.load(f)

# loading questions
desc_df = pd.read_csv(raw_data_dir + s5_raw_decriptions, encoding='latin-1')
desc_df['diagram'] = desc_df['reference_id'].apply(lambda x: x.split('/')[-1])

ai2_raw_decriptions_df = pd.read_csv(raw_data_dir + ai2_raw_decriptions, encoding='latin-1')
ai2_written_df_completed = ai2_raw_decriptions_df[['Topic', 'Image Path', 'Description']]
ai2_written_df_completed['diagram'] = ai2_written_df_completed['Image Path'].apply(lambda x: x.split('/')[-1])
ai2_written_df_completed['topic'] = ai2_written_df_completed['Topic']
del  ai2_written_df_completed['Topic']

#loading questions
q_col = '03_write_question'
r_ans_col = '04_write_right_answer'
w_ans_col = '05_write_wrong_answers'
data_cols = [q_col, r_ans_col, w_ans_col]
raw_dq_df = pd.read_csv(raw_data_dir + raw_dq_file, encoding='latin-1')
dr_proc_df = raw_dq_df.copy()
dr_proc_df['wac_list'] = dr_proc_df[w_ans_col].apply(lambda x: json.loads(x))
dr_proc_df['diagram'] = dr_proc_df['reference_id'].apply(lambda x: x.split('/')[-1])
dr_proc_df['topic'] = dr_proc_df['reference_id'].apply(lambda x: x.split('/')[-1].rsplit('_', maxsplit=1)[0])

with open('../diagram_questions/topic_match_terms.json', 'r') as f:
    topic_term_match = json.load(f)    

with open(glossary_path, 'rb') as f:
    flexbook_glossary = pickle.load(f)

### localization and recognition

In [8]:
loc_res_df = pd.read_pickle(os.path.join(metadata_dir, lc_results_dir, 'complete_df.pkl'))
recog_res_df = pd.read_pickle(os.path.join(metadata_dir, recog_results_dir, 'recog_df.pkl'))

## building spellings and grammar

In [112]:
# loading spelling defs
with open(output_dir + 'ck_12_vocab_words.pkl', 'rb') as f:
    ck_12_vocab = set(pickle.load(f))
with open(output_dir + 'ck_12_all_words.pkl', 'rb') as f:
    ck_12_corp = set(pickle.load(f))
    
with open(output_dir + 'spellings_to_rev.txt', 'r') as f:
    whitelisted_words = f.read().split('\n')[:-1]    
with open(output_dir + './desc_spellings_to_rev.txt', 'r') as f:
    whitelisted_words += f.read().split('\n')[:-1]
with open(output_dir + './ck_12_spelling_rev.txt', 'r') as f:
    whitelisted_words += f.read().split('\n')[:-1]
    
ck_12_corp.update(ck_12_vocab)
ck_12_corp.update(whitelisted_words)
# this must be run later- post loading and processing recog
ck_12_corp.update(diagram_rec_corpus)

# build spelling dict updated with words from science corpus
edict = enchant.Dict("en_US")
anglo_edict = enchant.Dict("en_UK")
cached_sw = stopwords.words("english") + list(string.punctuation)
for word in ck_12_corp:
    if word.isalpha() and len(word) > 3:
        edict.add(word)
        
# grammaer checker
gram_checker = language_check.LanguageTool('en-US')
gram_checker.disabled = set(['SENT_START_CONJUNCTIVE_LINKING_ADVERB_COMMA', 'POSSESSIVE_APOSTROPHE', 'A_PLURAL'])
gram_checker.disable_spellchecking()

# Clean and prepare data

## extract media links

In [110]:
ck12_combined_dataset = deepcopy(ck12_combined_dataset_raw)

In [113]:
pat_str = "(?:https?:\/\/(?:www\.).*?\s)"
web_link_patern=re.compile(pat_str)

def clean_content_text(content_str, web_link_patern):
    removed_links = web_link_patern.findall(content_str)
    if not removed_links:
        return '', ''
    split_txt = web_link_patern.split(content_str)
    cleaned_text = ' '.join([txt for txt in split_txt if txt])
    return cleaned_text, [link.strip() for link in removed_links]

def extract_links(complete_ds):
    for subject, lessons in complete_ds.items():
        for lesson_title, lesson in lessons.items():
            for topic, content in lesson['topics'].items():
                content_str = content['content']['text']
                new_text, links = clean_content_text(content_str, web_link_patern)
                content['content']['mediaLinks'] = []
                if links:
                    content['content']['text'] = new_text
                    content['content']['mediaLinks'].extend(links)

In [114]:
extract_links(ck12_combined_dataset)

## remove non-conforming content

### code

In [13]:
def validate_schema(dataset_json):
    errors = []
    try:
        validator = jsonschema.Draft4Validator(schema)
        for error in sorted(validator.iter_errors(dataset_json), key=lambda x: x.absolute_path[0]):
            errors.append([error.message, list(error.absolute_path)[:4]])
    except jsonschema.ValidationError as e:
        errors.append("Error in schema --%s-", e.message)
    return errors

def validate_dataset(dataset_json):
    for subject, flexbook in dataset_json.items():
        schema_errors = validate_schema(flexbook)
        for lesson_name, lesson in flexbook.items():
            ac_errors = check_ac_counts(lesson, subject, lesson_name)
        all_errors = schema_errors + ac_errors
        if not all_errors:
            return 'all validation test passed'
        else:
            return all_errors

def check_ac_counts(lesson_content, subject, lesson_name):
    errors = []
    for qid, question in lesson_content['questions']['nonDiagramQuestions'].items():
        if question['type'] == 'Multiple Choice':
            if len(question['answerChoices']) != 4:
                errors.append([subject, lesson_name, qid + ' mc error'])
        if question['type'] == 'True or False':
            if len(question['answerChoices']) != 2:
                errors.append([subject, lesson_name, qid + ' tf error'])
    return errors

def record_validation_errors(dataset):
    qs_removed = []
    for subject, flexbook in dataset.items():
        validator = jsonschema.Draft4Validator(schema)
        for error in sorted(validator.iter_errors(flexbook), key=lambda x: x.absolute_path[0]):
            lesson, quest, question_class, q_number = list(error.absolute_path)[:4]
            problem_q_section = dataset[subject][lesson][quest][question_class]
            if q_number in problem_q_section.keys():
#                 print(dataset[subject][lesson][quest][question_class].pop(q_number))
                qs_removed.append(dataset[subject][lesson][quest][question_class].pop(q_number))
    return qs_removed

### run

In [115]:
validate_dataset(ck12_combined_dataset)

'all validation test passed'

In [116]:
qs_rem = record_validation_errors(ck12_combined_dataset)
len(qs_rem)

0

## remove recognition and localization errors

In [16]:
diagram_image_names = clean_list(recog_performed)

rec_files = glob.glob(all_dir + '*/*')
more_paths = glob.glob(all_dir + '*/*')
pruned_paths = glob.glob(pruned_dir + '*/*')
more_files = [fp.split('/')[-1] for fp in more_paths]
pruned_files = [fp.split('/')[-1] for fp in pruned_paths]
desc_paths = glob.glob(description_dir + '*/*')
desc_files = [fp.split('/')[-1] for fp in desc_paths]


pruned_nums = set([get_img_n(name) for name in pruned_files])
all_nums = set([get_img_n(name) for name in more_files])
rec_nums = set([get_img_n(name) for name in diagram_image_names])
desc_nums = set([get_img_n(name) for name in desc_files])

removed_images = all_nums.difference(pruned_nums.union(desc_nums))

removed_image_names = []
for img_n in removed_images:
    for image_name in more_files:
        if img_n == get_img_n(image_name):
            removed_image_names.append(image_name)

name_change_lookup = {}
for image_name in more_files:
    img_n = get_img_n(image_name)
    for newer_name in pruned_files:
        if img_n == get_img_n(newer_name) and newer_name != image_name:
            name_change_lookup[image_name] = newer_name

removed_image_names = sorted(removed_image_names)

In [17]:
blacklisted_topics = ['periodic_table', 'em_spectrum', 'hydrocarbons', 'geologic_time'] + ['lewis_dot_idapgrams', 'circuits']  # correct this mispelling in future round

In [18]:
len(removed_image_names)

9

# Add image annotations

## localization

In [19]:
loc_res_df.head(1)

Unnamed: 0,diagram,rectangle,hit_id,assignment_id,worker_id
0,parts_cell_1182.png,"[[283, 192], [447, 238]]",3SA4EMRVJV39U1MGLCYP6KPFULH0PX,3BDCF01OGXVJNV1XRULS5F5Z4B6LYG,A1017VP86SLXRB


In [20]:
loc_anno = clean_list(os.path.join(turk_proc_dir, box_loc_joined))
loc_anno_images = [fig.split('.json')[0]  for fig in loc_anno]
keep_figures = [fig for fig in loc_anno_images if fig not in removed_image_names]

loc_box_path = os.path.join(turk_proc_dir, box_loc_joined)

diag_loc_annotations = {}
for diagram_name in keep_figures:
    anno_file_path = os.path.join(loc_box_path, diagram_name + '.json')
    if not os.path.exists(anno_file_path):
        diagram_name = diagram_name.replace('optics_rays', 'optics_ray_diagrams')
        anno_file_path = os.path.join(loc_box_path, diagram_name  + '.json')
    with open(anno_file_path, 'r') as f:
        diag_loc_annotations[diagram_name] = json.load(f)

combined_master_file_list = pruned_files + desc_files
combined_master_file_list_whitelisted = [file for file in combined_master_file_list if file in keep_figures]
files_still_needing_localisation = sorted(list(set(combined_master_file_list).difference(set(diag_loc_annotations))))
len(files_still_needing_localisation)

142

## recognition

### code

In [21]:
def most_common_strict(image_response):
    """
    returns the consensus response of the three raw response strings for a given image
    """
    most_common = image_response[1]['raw_text'].mode()
    if most_common.empty:
        most_common = 'nonconsensus'
        noncon.append(image_response[1]['raw_text'])
    else:
        most_common = most_common.values[0]
    return most_common

def most_common_lax(image_response, strings_denoting_missing_image=[]):
    """
    returns the consensus response after stripping white space and converting the reponses to lower case
    """
    simple_sanitizer = lambda x : x.lower().strip().lstrip()
    ind_responses = image_response[1]['raw_text'].values
    probobly_blanks = [response for response in ind_responses if response in strings_denoting_missing_image]
    if probobly_blanks:
        return 'skip'
    most_common = image_response[1]['raw_text'].apply(simple_sanitizer).mode()
    if most_common.empty:
        most_common = 'no consensus'
        noncon[image_response[0][0]].extend(image_response[1]['raw_text'])
    else:
        most_common = most_common.values[0]
    return most_common

def find_transcriptions_matches(batch_results_df, response_matcher):
    """
    returns a pandas series with the consunsus response for each image
    """
    agreed_responses = pd.DataFrame()
    for image_response in batch_results_df.groupby(['diagram', 'box_diag_idx']):
        diagram_and_idx = image_response[0]
        most_common = response_matcher(image_response, strings_denoting_missing_image=[])
        if most_common == 'skip':
            continue
        this_row = pd.DataFrame(list(diagram_and_idx) + [most_common, image_response[1]['rectangle'].iloc[0], image_response[1]['assignment_id'].iloc[0]]).T
        agreed_responses = pd.concat([agreed_responses, this_row])
        # The reindex below is needed to match the original df index after the groupby operation
    agreed_responses.columns = ['diagram', 'box_diag_idx', 'consensus_res', 'rectangle', 'assignment_id']
    return agreed_responses

### run

In [22]:
recog_performed_on = set(pd.unique(recog_res_df['diagram']).tolist())
len(recog_performed_on)

2190

In [24]:
files_still_needing_recognition = sorted(list(set(pruned_files).difference(set(recog_performed_on))))
print(len(files_still_needing_recognition))
file_with_loc_no_recog = set(files_still_needing_recognition).difference(files_still_needing_localisation)
print(len(file_with_loc_no_recog))

131
7


In [28]:
noncon = defaultdict(list)
transcription_results_lax = find_transcriptions_matches(recog_res_df, most_common_lax)

In [29]:
noncon_entries = [entries for entries in noncon.values()]
flattened_noncon = [item for sublist in noncon_entries for item in sublist]

In [30]:
curated_no_image_strings = set(['*no image showing*', '', ' ', 'NA', '?', 'na', '0', 'No image found', 'blank', 'Nothing showing', "where is the images , i can't see anything", 'NO IMAGE', ''])

In [31]:
non_blank_no_consensus = {d_name: rec_res for d_name, rec_res in noncon.items() if not curated_no_image_strings.intersection(set(rec_res))}
blank_no_consensus = {d_name: rec_res for d_name, rec_res in noncon.items() if curated_no_image_strings.intersection(set(rec_res))}
print(len(non_blank_no_consensus))
print(len(blank_no_consensus))

408
99


In [32]:
flattened_noncon_no_blank = [item for sublist in non_blank_no_consensus.values() for item in sublist]
build_diagram_rec_corpus  = [words.split() for words in transcription_results_lax['consensus_res'].values.tolist()]
diagram_rec_corpus = set([item.lower().strip() for sublist in build_diagram_rec_corpus for item in sublist if item.isalpha() and len(item) > 3])

### hide

In [151]:
# strings_denoting_missing_image = list(pd.Series(flattened_noncon).value_counts()[:20].index)
# Image.open('../ai2-vision-textbook-dataset/diagrams/turk_data/optics_ray_diagrams_9170.png')

In [163]:
len(diagram_rec_corpus)

4073

# Integrate diagram questions and descriptions

## match diagram topics to lessons

first need to match diagram topics to flexbook lessons

### code

In [33]:
def make_topic_matches(topic_list, combined_topics):
    topic_matches = {}
    for diagram_topic in topic_list:
        topic_matches[diagram_topic] = []
        for terms in topic_term_match[diagram_topic]:
            lev_dist_threshed = [topic for topic in combined_topics.keys() if fuzz.ratio(topic, terms) > 85]
            topic_matches[diagram_topic] += lev_dist_threshed
        if not topic_matches[diagram_topic]:
                for terms in topic_term_match[diagram_topic]:
                    lev_dist_threshed = [topic for topic in combined_topics.keys() if fuzz.token_set_ratio(topic, terms) > 80]
                    topic_matches[diagram_topic] += lev_dist_threshed
    return topic_matches

def make_lesson_matches(ck12_dataset, diagram_topic_name, topic_matches):
    lesson_matches = defaultdict(list)
    lessons_seen = set()
    content_topics =  topic_matches[diagram_topic_name]
    for topic in sorted(content_topics):
        associated_lesson =combined_topics[topic]['lesson']
        if associated_lesson not in lessons_seen:
            lessons_seen.add(associated_lesson)
            lesson_matches[diagram_topic_name].append(associated_lesson)
    return dict(lesson_matches)

### run

The pruned directory is the tqa 0.91 set assmbled by Ani on Sept 27th. It should be treated as definitive

In [34]:
diagram_topic_list = clean_list(pruned_dir)

In [35]:
es_lesson_names = [item for sublist in [val['topics'].keys() for val in ck12_combined_dataset['earth-science'].values()] for item in sublist]
ps_lesson_names = [item for sublist in [val['topics'].keys() for val in ck12_combined_dataset['physical-science'].values()] for item in sublist]
ls_lesson_names = [item for sublist in [val['topics'].keys() for val in ck12_combined_dataset['life-science'].values()] for item in sublist]

combined_lessons = es_lesson_names + ps_lesson_names + ls_lesson_names
topic_series = pd.Series(combined_lessons).value_counts()
# the 17 here found by inspection- any "topic" appearing many times is something general like review, vocab, etc
topics_to_remove = list(topic_series[:17].index)

In [36]:
combined_topics = defaultdict(dict)
for subject, book in ck12_combined_dataset.items():
    for lesson, material in book.items():
        for topic, text in material['topics'].items():
            if topic in topics_to_remove:
                continue
            combined_topics[topic.lower()]['lesson'] = lesson

In [37]:
topic_matches = make_topic_matches(diagram_topic_list, combined_topics)
missing= []
for k, v in topic_matches.items():
    if not v:
        missing.append(k)

In [38]:
matching_lessons = {}
for topic in diagram_topic_list:
    matched_lessons = make_lesson_matches(ck12_combined_dataset, topic, topic_matches)
    matching_lessons.update(matched_lessons)

In [39]:
diagram_lesson_lookup = {}
for d_topic, lessons in matching_lessons.items():
    diagram_lesson_lookup[d_topic] = sorted(lessons)[0]

In [40]:
#manually correct name changes made since diagrams were assembled
diagram_lesson_lookup['lewis_dot_diagrams'] = diagram_lesson_lookup['lewis_dots']
diagram_lesson_lookup['optics_ray_diagrams'] = diagram_lesson_lookup['optics_rays']

### hide

In [41]:
lessons_seen = []
dupe_lessons = []
for k, v in diagram_lesson_lookup.items():
    if v not in lessons_seen:
        lessons_seen.append(v)
    else:
        dupe_lessons.append(v)

In [45]:
dupe_topics = defaultdict(list)
for k, v in diagram_lesson_lookup.items():
    if v in dupe_lessons:
        dupe_topics[v].append(k)
# dupe_topics

In [46]:
missing

[]

In [43]:
len(diagram_lesson_lookup.keys())

len(set(diagram_lesson_lookup.values()))

for k, v in sorted(matching_lessons.items()):
    print(k)
    print(sorted(v))
    print()

acid_rain_formation
['22.2 Effects of Air Pollution']

aquifers
['13.3 Groundwater']

atomic_mass_number
['5.1 Inside the Atom', 'atomic number', 'matter mass and volume']

atomic_structure
['5.1 Inside the Atom']

biomes
['climate zones and biomes']

blastocyst
['22.3 Reproduction and Life Stages']

cell_division
['5.1 Cell Division']

cellular_respiration
['9.4 Biochemical Reactions']

chemical_bonding_covalent
['7.3 Covalent Bonds']

chemical_bonding_ionic
['7.2 Ionic Bonds']

circuits
['23.3 Electric Circuits']

continental_drift
['6.2 Continental Drift']

convection_of_air
['15.2 Energy in the Atmosphere']

cycle_carbon
['18.2 Cycles of Matter']

cycle_nitrogen
['18.2 Cycles of Matter']

cycle_rock
['4.1 Types of Rocks', 'rocks and processes of the rock cycle']

cycle_water
['24.2 Cycles of Matter']

dna
['nucleic acid classification']

earth_day_night
['rotation of earth']

earth_eclipses
['24.4 The Sun and the EarthMoon System']

earth_magnetic_field
['earth as a magnet']

earth

In [190]:
# pprint.pprint(dict(dupe_topics))

## merge questions

### code

In [48]:
dq_image_folder = 'diagram-question-images/'
td_image_folder = 'diagram-teaching-images/'

def make_question_entry(qdf_row):
    ask = qdf_row[qdf_row.index == '03_write_question'].values[0]
    answer = qdf_row[qdf_row.index == '04_write_right_answer'].values[0]
    wrong_answers = qdf_row[qdf_row.index == 'wac_list'].values[0]
    q_topic = qdf_row[qdf_row.index == 'lesson_assigned_to'].values[0]
    image_uri = qdf_row[qdf_row.index == 's3_uri'].values[0]
    image_name = qdf_row[qdf_row.index == 'diagram'].values[0]
    
    def make_answer_choices(answer_choices):
        build_answer_choices = {}
        letter_options = list('abcd')
        random.shuffle(answer_choices)
        for idx, answer_choice in enumerate(answer_choices):
            answer_choice_dict = {
                "idStructural": letter_options[idx] + '.',
                "rawText": answer_choice,
                "processedText": answer_choice
            }
            build_answer_choices[letter_options[idx]] = answer_choice_dict
        return build_answer_choices
    a_choices = make_answer_choices(wrong_answers + [answer])
    single_q_dict = {
        "id": 'q',
        "type": 
            "Diagram Multiple Choice",
        "beingAsked": {
            "rawText": ask,
            "processedText": ask.encode('ascii', 'ignore').decode('utf-8')
        },
        "correctAnswer": {
            "rawText": answer,
            "processedText": answer.encode('ascii', 'ignore').decode('utf-8')
        },
        "answerChoices": a_choices,
        "imageUri": image_uri,
        "imageName": image_name
    }
    build_questions[q_topic].append(single_q_dict)
    
    
def refine_question_formats(raw_questions):
    reformatted_dq_ds = {}
    for topic, topic_questions in raw_questions.items():
        reformatted_topic = {topic: {'questions': {'diagramQuestions': {}}}}
        reformatted_questions = {}
        for idx, question in enumerate(topic_questions):
            question = deepcopy(question)
            question['id'] += str(idx + 1).zfill(4)
            reformatted_questions[question['id']] = question
        reformatted_topic[topic]['questions']['diagramQuestions'] = reformatted_questions
        reformatted_dq_ds.update(reformatted_topic)
    return reformatted_dq_ds

s3_base = 'https://s3.amazonaws.com/ai2-vision-textbook-dataset/diagrams/' + dq_image_folder
s3_base_descriptions = 'https://s3.amazonaws.com/ai2-vision-textbook-dataset/diagrams/' + td_image_folder

def make_image_link(old_url, s3_base=s3_base):
    image_name = old_url.split('/')[-1]
    new_url = s3_base + image_name
    return new_url

### run

In [121]:
dr_proc_df['s3_uri'] = dr_proc_df['reference_id'].apply(make_image_link)
dr_proc_df['lesson_assigned_to'] = dr_proc_df['topic'].apply(lambda x: diagram_lesson_lookup[x])

In [118]:
build_questions = defaultdict(list)
_ = dr_proc_df.apply(make_question_entry, axis=1)

In [119]:
refined_questions = refine_question_formats(build_questions)

In [120]:
for subject, lessons in ck12_combined_dataset.items():
    for l_name, lesson in lessons.items():
        if l_name in refined_questions.keys():        
            lesson['questions']['diagramQuestions'] = refined_questions[l_name]['questions']['diagramQuestions']

### hide

In [63]:
refined_questions = dict(refine_question_formats(build_questions))

refined_questions['10.4 Erosion and Deposition by Glaciers']['questions'].keys()

dict_keys(['diagramQuestions'])

In [64]:
refined_questions['10.4 Erosion and Deposition by Glaciers']

len(ck12_combined_dataset['earth-science']['10.4 Erosion and Deposition by Glaciers']['questions']['diagramQuestions'])

len(ck12_combined_dataset['earth-science']['10.4 Erosion and Deposition by Glaciers']['questions']['nonDiagramQuestions'])

val_counts=dr_proc_df['lesson_assigned_to'].value_counts()

In [65]:
val_counts

10.1 Introduction to Plants                     1146
24.1 Flow of Energy                             1006
3.2 Cell Structures                              916
12.4 Insects and Other Arthropods                719
17.3 The Digestive System                        569
24.4 The Sun and the EarthMoon System            389
24.1 Planet Earth                                361
6.1 Inside Earth                                 347
20.1 The Nervous System                          337
19.1 The Respiratory System                      291
10.2 Evolution and Classification of Plants      281
22.3 Vision                                      267
8.3 Types of Volcanoes                           213
25.1 Introduction to the Solar System            197
22.2 Optics                                      195
11.3 Nuclear Energy                              192
18.1 Overview of the Cardiovascular System       181
4.2 Photosynthesis                               177
14.2 Ocean Movements                          

## merge descriptions

In [122]:
def make_description_entry(qdf_row):
    description = qdf_row[qdf_row.index == 'Description'].values[0]
    q_topic = qdf_row[qdf_row.index == 'lesson_assigned_to'].values[0]
    image_uri = qdf_row[qdf_row.index == 's3_uri'].values[0]
    image_name = qdf_row[qdf_row.index == 'diagram'].values[0]
    image_key = image_name.replace('.png', '')
    single_desc_dict = {
        "imageUri": image_uri,
        "imageName": image_name,
        "rawText": description,
        "processedText": description.encode('ascii', 'ignore').decode('utf-8')
        }
    if image_key not in build_descriptions[q_topic].keys():
        build_descriptions[q_topic].update({image_key: single_desc_dict})
    # I've found the longest description is usually best
    elif len(single_desc_dict['processedText']) > len(build_descriptions[q_topic][image_key]['processedText']):
        build_descriptions[q_topic].update({image_key: single_desc_dict})

In [123]:
%%capture
ai2_written_df_completed['lesson_assigned_to'] = ai2_written_df_completed['topic'].apply(lambda x: diagram_lesson_lookup[x])
ai2_written_df_completed['s3_uri'] = ai2_written_df_completed['Image Path'].apply(make_image_link)
ai2_written_df_completed = ai2_written_df_completed.dropna()

desc_df['topic'] = desc_df['diagram'].apply(lambda x: x.rsplit('_', maxsplit=1)[0])
desc_df['lesson_assigned_to'] = desc_df['topic'].apply(lambda x: diagram_lesson_lookup[x])
desc_df['s3_uri'] = desc_df['reference_id'].apply(make_image_link)
desc_df['Description'] = desc_df['01_write_description']             

In [124]:
build_descriptions = defaultdict(dict)
_ = desc_df.apply(make_description_entry, axis=1)
_ = ai2_written_df_completed.apply(make_description_entry, axis=1)

In [125]:
# this adds the descriptions to the combined dataset
for subject, lessons in ck12_combined_dataset.items():
    for l_name, lesson in lessons.items():
        if l_name in build_descriptions.keys():
            lesson['instructionalDiagrams'] = build_descriptions[l_name]
        else:
            lesson['instructionalDiagrams'] = {}

### hide

In [233]:
pd.unique(desc_df['lesson_assigned_to']).shape

(84,)

In [183]:
build_descriptions.keys()

dict_keys(['5.1 Cell Division', '14.1 Introduction to the Oceans', '18.2 Cycles of Matter', '20.1 The Nervous System', 'nucleic acid classification', '9.2 Soils', '17.1 Climate and Its Causes', '12.5 Echinoderms and Invertebrate Chordates', '24.2 Cycles of Matter', 'radioactive decay as a measure of age', '8.3 Types of Volcanoes', '19.2 The Excretory System', '1.4 The Microscope', '16.3 The Skeletal System', '9.1 Protists', '14.3 The Ocean Floor', '4.1 Solids Liquids Gases and Plasmas', '6.1 Inside Earth', '10.1 Erosion and Deposition by Flowing Water', 'earth as a magnet', '16.3 Simple Machines', '10.2 Evolution and Classification of Plants', '9.2 Fungi', 'flatworms', '5.1 Inside the Atom', 'rotation of earth', '21.3 First Two Lines of Defense', 'nails and hair', '9.4 Biochemical Reactions', 'clouds', '10.4 Erosion and Deposition by Glaciers', '25.1 Introduction to the Solar System', 'blood vessels', '25.2 Using Electromagnetism', '19.1 The Respiratory System', 'faults', '13.2 Fish', 

In [185]:
# with open(output_dir + 'ck12_dataset_beta_v4.json', 'w') as f:
#     json.dump(ck12_combined_dataset, f, indent=4, sort_keys=True)

In [186]:
# with open(output_dir + 'ck12_dataset_beta_v4.json', 'r') as f:
#     ck12_combined_dataset = json.load(f)

## Apply spelling and grammar fixes

### code

In [66]:
def check_mispelled(word):
    return word and word.isalpha() and not (edict.check(word) or anglo_edict.check(word) or edict.check(word[0].upper() + word[1:]))

def correct_spelling_error(misspelled_word, suggested_spellings):
    highest_ratio = 0
    closest_match = None
    for word in suggested_spellings:
        match_r = fuzz.ratio(misspelled_word, word)
        if match_r >= highest_ratio and (word[0] == misspelled_word[0] or not check_mispelled(word[0] + misspelled_word)) and len(misspelled_word) <= len(word):
            highest_ratio = match_r
            closest_match = word
            break
    spell_changes[misspelled_word] = closest_match
    return closest_match

def apply_spelling_fix(orig_text):
    orig_text_tokens = wordpunct_tokenize(orig_text)
    processed_tokens = []
    for token in orig_text_tokens:
        norm_token = token.lower()
        if len(norm_token) < 4:
            processed_tokens.append(token)
            continue
        if check_mispelled(norm_token):
            suggested_replacements = edict.suggest(token)
            replacement_text = correct_spelling_error(norm_token, suggested_replacements)
            if replacement_text:
                if norm_token[0].isupper():
                    replacement_text = upper(replacement_text[0]) + replaced_text[1:]
                processed_tokens.append(replacement_text)
            else:
                processed_tokens.append(token)
        else:
            processed_tokens.append(token)
    return ' '.join(processed_tokens)

def diff_corrected_text(orig_text, corrected_text):
    diff = dmp.diff_main(orig_text, corrected_text)
    return HTML(dmp.diff_prettyHtml(diff))

def specify_lesson_q_path(lesson):
    pass
    

def apply_spelling_and_grammar_to_ds(ck12_ds):
    return

### run

In [126]:
dmp = diff_match_patch.diff_match_patch()

In [127]:
ck12_spell_gramm_fix_test = deepcopy(ck12_combined_dataset)

In [128]:
gram_checker = language_check.LanguageTool('en-US')
gram_checker.disabled = set(['SENT_START_CONJUNCTIVE_LINKING_ADVERB_COMMA', 'POSSESSIVE_APOSTROPHE', 'A_PLURAL'])
gram_checker.disable_spellchecking()

punc_set_space = set([',', ':', ';', '/"'])
punc_set_nospace = set(['-', '\'', '-', '?', '.', '!'])
question_enders = set(['.', '?', ':'])

In [129]:
#check descriptions
spell_changes = {}
unaltered_text = []
replaced_text = []
for lesson in list(ck12_spell_gramm_fix_test['life-science'].values()):
    if lesson['instructionalDiagrams']:
        for diagram, description in lesson['instructionalDiagrams'].items():
            orig_text = description['processedText']
            spell_fixed_text = apply_spelling_fix(orig_text)
            for punc_char in punc_set_nospace:
                spell_fixed_text = spell_fixed_text.replace(' ' + punc_char + ' ' , punc_char)
            for punc_char in punc_set_space:
                spell_fixed_text = spell_fixed_text.replace(' ' + punc_char + ' ' , punc_char + ' ')
            gram_fixed = gram_checker.correct(spell_fixed_text)
            if gram_fixed != orig_text:
                unaltered_text.append(orig_text)
                replaced_text.append(gram_fixed)

In [2603]:
#check diagram questions
spell_changes = {}
unaltered_text = []
replaced_text = []
for lesson in list(ck12_spell_gramm_fix_test['life-science'].values()):
    if lesson['questions']['nonDiagramQuestions']:
        for diagram, description in lesson['questions']['DiagramQuestions'].items():
            orig_text = description['beingAsked']['processedText']
            spell_fixed_text = apply_spelling_fix(orig_text)
            gram_fixed = gram_checker.correct(spell_fixed_text)
            for punc_char in punc_set_nospace:
                gram_fixed = gram_fixed.replace(' ' + punc_char + ' ' , punc_char)
                gram_fixed = gram_fixed.replace(' ' + punc_char, punc_char)
            for punc_char in punc_set_space:
                gram_fixed = gram_fixed.replace(' ' + punc_char + ' ' , punc_char + ' ')
            if gram_fixed[-1] not in question_enders:
                if gram_fixed.split()[0] in ['Identify', 'Name'] or '__' in gram_fixed:
                    gram_fixed += '.'
                else:
                    gram_fixed += '?'
            if gram_fixed != orig_text:
                unaltered_text.append(orig_text)
                replaced_text.append(gram_fixed)

In [130]:
comp_text = list(zip(unaltered_text, replaced_text))

In [131]:
print(len(spell_changes))
print(len(comp_text))
# spell_changes

18
65


In [132]:
rand_idx = np.random.randint(len(comp_text))
print(unaltered_text[rand_idx])
print()
print(replaced_text[rand_idx])
diff_corrected_text(*comp_text[rand_idx])

The diagram shows the circulatory system. It is the system that circulates blood and lymph through the body consisting of the heart, blood vessels, blood, lymph, and the lymphatic vessels and glands. Arterial circulation is the part of your circulatory system that involves arteries, like the aorta and pulmonary arteries. Arteries are blood vessels that carry blood away from your heart. (The exception is the coronary arteries, which supply your heart muscle with oxygen-rich blood.) Venous circulation is the part of your circulatory system that involves veins, like the vena cavae and pulmonary veins. Veins are blood vessels that carry blood to your heart. Veins have thinner walls than arteries.

The diagram shows the circulatory system. It is the system that circulates blood and lymph through the body consisting of the heart, blood vessels, blood, lymph, and the lymphatic vessels and glands. Arterial circulation is the part of your circulatory system that involves arteries, like the aort

### hide

In [304]:
# with open(output_dir + 'ck12_dataset_beta_v4.json', 'r') as f:
#     ck12_combined_dataset = json.load(f)

# Topic key collisions

In [133]:
flexbook_ds.keys()

dict_keys(['physical-science', 'earth-science', 'life-science'])

In [134]:
build_website_lessons = [list(lesson.keys()) for lesson in lessons_ds.values()]
website_lessons= sorted([item for sublist in build_website_lessons for item in sublist])

build_flexbook_lessons = [list(lesson.keys()) for lesson in flexbook_ds.values()]
flexbook_lessons= [item for sublist in build_flexbook_lessons for item in sublist]
flexbook_lessons = sorted([lesson.split(maxsplit=1)[1].strip().lower() for lesson in flexbook_lessons])
fbls = set(flexbook_lessons)
wsls = set(website_lessons)

In [135]:
len(flexbook_lessons)

247

In [136]:
print(len(flexbook_lessons))
print(len(set(flexbook_lessons)))

247
243


In [137]:
print(len(website_lessons))
print(len(set(website_lessons)))

829
829


In [97]:
len(set(website_lessons).union(set(flexbook_lessons)))

1024

# Refinements to make

### todo

global ids

make all abc questions upper/lower case

add vocab defs

punctuation/ text normalization

maybe spell and grammar on entire set

linking lessons

image_uris now point to local dir instead of s3

remove ck12 lesson numbers from keys and image names

### develop

In [251]:
#print(topics_to_remove) #specicied in match topics section above, explictly set here

In [252]:
structural_topics = ['Summary', 'Review', 'References', 'Explore More', 'Lesson Summary', 'Lesson Objectives', 'Points to Consider', 'Introduction',
                    'Recall', 'Apply Concepts', 'Think Critically', 'Resources', 'Explore More II', 'Explore More I', 'Explore More III']

vocab_topics = ['Lesson Vocabulary', 'Vocabulary']

In [295]:
def iterate_over_all_material(complete_ds):
    dq_gid = 0
    ndq_gid = 0
    for subject, lessons in complete_ds.items():
        for lesson_name, lesson_content in lessons.items():
            struct_topics = iterate_over_text(lesson_content['topics'])
#             lesson_content['adjunctTopics'] = struct_topics
            iterate_over_text_questions(lesson_content['questions']['nonDiagramQuestions'])
            if lesson_content['instructionalDiagrams']:
                if not lesson_content['questions']['diagramQuestions']:
                    print(lesson_name + ' missing questions')
                iterate_over_diagram_questions(lesson_content['questions']['diagramQuestions'])
                iterate_over_diagram_descriptions(lesson_content['instructionalDiagrams'])
            
def iterate_over_text(topic_sections):
    structural_content = {}
    for topic, content in topic_sections.items():
        if topic in vocab_topics:
            add_defintions_to_vocab(content)
        elif topic in structural_topics:
            structural_content[topic] = content
    return structural_content
    
def add_defintions_to_vocab(vocab_section):
#     print('adding')
    pass

def iterate_over_text_questions(text_questions):
        for qid, question in text_questions.items():
            add_global_ids(question, 'text')
            
def iterate_over_diagram_questions(diagram_questions):
        for qid, question in diagram_questions.items():
            add_global_ids(question, 'diagram')
            replace_uri_with_path(question, 'question_images')
            if detect_abc_question(question):
                standardize_abc_question(question)

def iterate_over_diagram_descriptions(diagram_descriptions, description_path_prefix=None):
    for diagram_name, diagram_content in diagram_descriptions.items():
        replace_uri_with_path(diagram_content, 'teaching_images')
    pass

def add_global_ids(global_count, which_counter):
#     print('add ids')
    pass

def detect_abc_question(question):
    return False

def standardize_abc_question(question):
    pass

def replace_uri_with_path(image_content, path_prefix):
    pass
#     image_content.pop('imageUri')
#     image_content['imagePath'] = None
    

In [296]:
iterate_over_all_material(test_cds)

In [267]:
list(test_cds['earth-science'].values())[0].keys()

dict_keys(['instructionalDiagrams', 'questions', 'topics', 'hidden', 'adjunctTopics'])

In [269]:
# list(test_cds['earth-science'].values())[0]['adjunctTopics']

In [263]:
# test_cds = deepcopy(ck12_combined_dataset)

In [None]:
#             print(lesson_content['topics'].keys())

# Splitting experiments

# End

In [207]:
# flexbook_glossary.keys()

In [163]:
# print(list(diagram_lesson_lookup.values()))

In [160]:
test_lesson = ck12_combined_dataset['earth-science']['24.1 Planet Earth']

In [162]:
# pprint.pprint(test_lesson['instructionalDiagrams'])

In [164]:
list(test_lesson['instructionalDiagrams'].values())[0]['processedText']

"This Diagram shows the Earth's rotation. Which is the amount of time that it takes to rotate once on its axis. This is, apparently, accomplished once a day  every 24 hours. However, there are actually two different kinds of rotation that need to be considered here. For one, theres the amount of time it take for the Earth to turn once on its axis so that it returns to the same orientation compared to the rest of the Universe. Then theres how long it takes for the Earth to turn so that the Sun returns to the same spot in the sky. Earth's rotation is slowing slightly with time; thus, a day was shorter in the past. This is due to the tidal effects the Moon has on Earth's rotation. Atomic clocks show that a modern-day is longer by about 1.7 milliseconds than a century ago, slowly increasing the rate at which UTC is adjusted by leap seconds."

In [165]:
test_lesson['topics'].keys()

dict_keys(['Earths Rotation', 'Earths Gravity', 'Earths Shape Size and Mass', 'Lesson Summary', 'Introduction', 'Lesson Objectives', 'Earths Magnetism', 'Recall', 'Points to Consider', 'Think Critically', 'Earths Revolution', 'Apply Concepts', 'Vocabulary', 'Earths Motions', 'Earths Day and Night', 'Earths Seasons'])

In [201]:
test_vocab = test_lesson['topics']['Vocabulary']['content']['text'].split('\n')

In [205]:
for word in test_vocab:
    if word in flexbook_glossary:
        print(flexbook_glossary[word])
        print()

an imaginary line that runs from the North Pole to South Pole through the center of Earth 

highest level of organization in ecology that includes all the parts of Earth where life can be found and consists of all the worlds biomes, both terrestrial and aquatic 

As traditionally defined, force of attraction between two masses. 

one half of a sphere 

all the water on Earth 

Area around a magnet where it exerts magnetic force. 



In [166]:
# write_file('ck12_v4_5.json', ck12_combined_dataset, 'experimental_output')