# Table of Contents
* [Load data](#Load-data)
* [dataset validation](#dataset-validation)
	* [schema validation](#schema-validation)
	* [other validation test](#other-validation-test)
* [exploring dataset](#exploring-dataset)
	* [topic names](#topic-names)
	* [question type dist](#question-type-dist)
	* [looking for missing values](#looking-for-missing-values)
* [html viz](#html-viz)
* [End](#End)


In [424]:
%%capture
import numpy as np
import pandas as pd
import scipy.stats as st
import itertools
import math
from collections import Counter, defaultdict, OrderedDict
%load_ext autoreload
%autoreload 2

import cv2
import pprint
import pickle
import json
import requests
import io
import sys
import os
from binascii import b2a_hex
import base64
from wand.image import Image as WImage
from IPython.display import display
import PIL.Image as Image
from copy import deepcopy
import glob

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.layout import LAParams, LTTextBox, LTTextLine, LTFigure, LTImage

import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

import pdfextraction.ck12_flex_extract as ck_ex

# Load data

__pieces from flexbooks and webbsite lessons seperated__

In [444]:
with open('ck12_flexbook_only_beta_v1.json', 'r') as f:
    flexbook_ds = json.load(f)
with open('ck12_lessons_only_beta_v1.json', 'r') as f:
    lessons_ds = json.load(f)

__combined dataset__

load or assemble from new pieces

In [445]:
# ck12_combined_dataset = {k: dict(v, **flexbook_ds[k]) for k, v in lessons_ds.items()}
with open('ck12_dataset_beta_v1.json', 'r') as f:
    ck12_combined_dataset = json.load(f)

# dataset validation

the topographic map lesson is thrown off by a missing diagram
the rest seem to be genuine unexpected_content

In [450]:
ds_assembler = ck_ex.CK12DataSetAssembler()
ds_assembler.validate_dataset(lessons_ds)

validating schema for earth-science





validating schema for life-science


u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short



validating schema for physical-science


u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short
u'' is too short





# exploring dataset

## topic names

In [383]:
es_lesson_names = [item for sublist in [val['topics'].keys() for val in ck12_combined_dataset['earth-science'].values()] for item in sublist]
ps_lesson_names = [item for sublist in [val['topics'].keys() for val in ck12_combined_dataset['physical-science'].values()] for item in sublist]
ls_lesson_names = [item for sublist in [val['topics'].keys() for val in ck12_combined_dataset['life-science'].values()] for item in sublist]

topic_series = pd.Series(es_lesson_names + ps_lesson_names + ls_lesson_names).value_counts()
topic_series[:18]

Summary                    751
Review                     748
References                 703
Explore More               483
Lesson Summary             250
Lesson Review Questions    248
Lesson Objectives          248
Introduction               246
Points to Consider         246
Recall                     245
Apply Concepts             243
Think Critically           243
Resources                  221
Lesson Vocabulary          151
Vocabulary                 101
Explore More II             94
Explore More I              93
Explore More III            17
dtype: int64

In [387]:
topic_series[18:40]

Climate                       6
Photosynthesis                6
Nuclear Energy                5
Carbohydrates                 5
Proteins                      5
Lipids                        5
Habitat                       5
External Resources            4
Earths Gravity                4
Food Webs                     4
Conduction                    4
Index Fossils                 4
Formation of Fossil Fuels     4
Acid Rain                     4
Whats the Matter              4
Transform Plate Boundaries    4
How Fossils Form              4
Formation                     4
The Carbon Cycle              4
Plate Tectonics               3
Conserving Energy             3
Practice                      3
dtype: int64

## question type dist

In [376]:
q_types = []
for subject, flexbook in ck12_combined_dataset.items():
    for lesson in flexbook.values():
        for question in lesson['questions']['nonDiagramQuestions'].values():
            q_types.append(question['type'])
question_counts = pd.Series(q_types).value_counts()
print 'total number of questions = ' + str(question_counts.sum())
question_counts

total number of questions = 15616


Multiple Choice      5489
True or False        4288
Fill in the Blank    2532
Matching             1711
Short Answer         1596
dtype: int64

In [372]:
for subject, flexbook in ck12_combined_dataset.items():
    q_types = []
    for lesson in flexbook.values():
        for question in lesson['questions']['nonDiagramQuestions'].values():
            q_types.append(question['type'])
    question_counts = pd.Series(q_types).value_counts()
    print 'total number of ' + subject + ' questions = ' + str(question_counts.sum())
    print question_counts
    print 

total number of earth-science questions = 5373
Multiple Choice      2342
True or False        1702
Fill in the Blank     705
Matching              598
Short Answer           26
dtype: int64

total number of life-science questions = 5005
Multiple Choice      1927
True or False        1117
Short Answer          786
Fill in the Blank     601
Matching              574
dtype: int64

total number of physical-science questions = 5238
True or False        1469
Fill in the Blank    1226
Multiple Choice      1220
Short Answer          784
Matching              539
dtype: int64



## looking for missing values

In [414]:
for subject, flexbook in lessons_ds.items():
    q_len = []
    for lesson_name, lesson in flexbook.items():
        q_len.append(len(lesson['questions']['nonDiagramQuestions'].values()))
        if q_len[-1] == 7:
            print  subject, lesson_name
            pprint.pprint(lesson['questions']['nonDiagramQuestions'])
    q_lengths = pd.Series(q_len).value_counts()
    print 'total number of ' + subject + ' lessons = ' + str(q_lengths.sum())
    print q_lengths
    print 

total number of earth-science lessons = 271
10    267
9       2
12      1
11      1
dtype: int64

total number of life-science lessons = 271
10    270
9       1
dtype: int64

physical-science velocity
{u'q01': {u'answerChoices': {},
          u'beingAsked': {u'processedText': u'how fast an object is moving is its _____.',
                          u'rawText': u'1. How fast an object is moving is its _____.'},
          u'correctAnswer': {u'processedText': u'speed'},
          u'id': u'q01',
          u'idStructural': u'1.',
          u'type': u'Fill in the Blank'},
 u'q02': {u'answerChoices': {},
          u'beingAsked': {u'processedText': u'the measure of both speed and direction is _____.',
                          u'rawText': u'2. The measure of both speed and direction is _____.'},
          u'correctAnswer': {u'processedText': u'velocity'},
          u'id': u'q02',
          u'idStructural': u'2.',
          u'type': u'Fill in the Blank'},
 u'q03': {u'answerChoices': {},
        

The lessons with fewer questions seem to be genuine, i.e. those are the number of questions in the workbook

# html viz

In [313]:
from pdfextraction.lesson_viz import display_lesson_html as lesson_viz

In [339]:
subject = 'life-science' 
# lesson = '15.1 Understanding Animal Behavior'
random_lesson = np.random.choice(flexbook_ds[subject].keys(), 1)[0]
lesson_viz(flexbook_ds[subject], random_lesson)

# End

In [24]:
import jinja2
from IPython.core.display import HTML

jnjenv = jinja2.Environment()

%%writefile lesson_viz.py
def make_lesson_data(lesson_json):
    nested_text = []    
    for topic, content in sorted(lesson_json['topics'].items(), key=lambda (k,v): v['orderID']):
        nested_text.append((topic, content['content']['text']))
    return nested_text

def make_page_html(lesson_data, page_html):
    return jnjenv.from_string(page_html).render(lesson=lesson_data[0], topics=lesson_data[1])

def display_lesson_html(flexbook, lesson):
    lesson_json = flexbook[lesson]
    lesson_data = (lesson, make_lesson_data(lesson_json))
    lesson_html = make_page_html(lesson_data, page_html)
    return HTML(lesson_html)

page_html = """
<!DOCTYPE html>
<html>
  <head>
    <style type="text/css">
    </style>
  </head>
  <body>
    <div class="container">
      <h1>Lesson: {{lesson}}</h1>
      <ul>
        {% for topic in topics %}
        <p>
        </p>
        <h3>{{topic.0}}</h3>
        <p>{{
        topic.1
        }}</p>
        {% endfor %}
      </ul>
    </div>
    <script src="http://code.jquery.com/jquery-1.10.2.min.js"></script>
    <script src="http://netdna.bootstrapcdn.com/bootstrap/3.0.0/js/bootstrap.min.js"></script>
  </body>
</html>
"""

In [None]:

stat_data = {'Number of Entities':stats_counter, 'Average Number per image': stats_fract}
count = 2
html = "<table>"
# add header row

html += "<tr><th>"
for k in stat_data.keys():
    html += "<th>"+k

html += "<tr><th>Entity Category"
for j in range(count):
    html += "<th>"

for k, v in stats_counter.items():
    html += "<tr><th>"+k
#     for j in range(count):
    html += "<td>" + str(v)
    html += "<td>" + "%.2f" % stats_fract[k]
html += '<tr>'
    
# for k, v in stats_fract.items():
#     html += "<tr><th>"+k
#     for j in range(count):
#         html += "<td>" + str(v)
        
html += "</table>"
HTML(html)

In [25]:
page_html = """
<html>
<head>
<title>{{ title }}</title>
</head>
<body>
Hello.
</body>
</html>
"""

In [18]:
page_template = """
<html>
 <head>
  <title>KB HIT</title>
  <meta content='text/html'/>
  <script type='text/javascript' src='https://s3.amazonaws.com/mturk-public/externalHIT_v1.js'></script>
 </head>
 <body>
    <p>We are constructing a large knowledge base (KB) about elementary science and commonsense knowledge, to help computers answer questions more reliably. We are planning to release the KB as a free, open source resource for the community when it is complete. Your work here will help us assemble this KB and contribute to this effort.</p>

    <p>Below, the computer has automatically extracted some candidate facts from text for possible inclusion in the KB. However, some are weird, false, or nonsensical. This task will help us distinguish the good facts, to include in the KB, from the bad.</p>
     <form name='mturk_form' method='post' id='mturk_form' action='https://workersandbox.mturk.com/mturk/externalSubmit'>
      <input type='hidden' value='' name='assignmentId' id='assignmentId'/>		 
      <table>
        <tr><th></th><th>Commonsense Knowledge</th></tr>
        {% for n in input_data %}
            <tr><td>{{n.sentence}}</td><td nowrap>
            <!--these break-->
            <!--<input type="hidden" name="{{n.sentence_id}}" id="assignmentId" value="ASSIGNMENT_NOT_AVAILABLE" />-->
            <!--<input type="hidden" name="assignmentId" id="assignmentId" value="ASSIGNMENT_NOT_AVAILABLE" />-->
            <!--this is in the official documentation but breaks anyway!-->
            <!--<input type='hidden' value='' name='assignmentId' id='assignmentId'/>-->
            <!--this works:-->
            <input name="{{n.sentence_id}}" type="radio" value="true-act" />EXPECTED ACTION
            <input name="{{n.sentence_id}}" type="radio" value="false-act" />RARE/FALSE ACTION
            <input name="{{n.sentence_id}}" type="radio" value="true-prop" />TRUE PROPERTY
            <input name="{{n.sentence_id}}" type="radio" value="false-prop" />RARE/FALSE PROPERTY
            <input name="{{n.sentence_id}}" type="radio" value="nonsense" />NONSENSE
            <input name="{{n.sentence_id}}" type="radio" value="unknown" />DON'T KNOW
            </td></tr>
        {% endfor %}
      </table>
      <p><input type="submit" id="submitButton" value="Submit" /></p>
   </form>
  <script language="Javascript">turkSetAssignmentID();</script>
 </body>
</html>

"""