# JEDM Reviewer Finder Database

To propose reviewers for new submissions we need a model of reviewer expertise. Our approach is to build a model using EDM publications and a vector space model.

Currently the following corpora are used:

- EDM 2011
- EDM 2014 to 2021
- AIED 2018, 2020, 2021

Missing years are not available on the EDM website or EasyChair (i.e., broken links)

Extraction from PDFs uses [GROBID](https://github.com/kermitt2/grobid/) in a commandline/batch process.

Resulting XML is then transformed into text suitable for our purposes.

The text is transformed into additional keywords [using pke.](https://github.com/boudinfl/pke)

and associated with vectors

Search proceeds by looking for keywords in a submitted document and then returning all authors/etc  that match. Note this is recall oriented rather than precision oriented and relies on the knowledge of the user to ignore spurious keywords.

The search code below is almost identical to what is used in the running web client

## Get PDFs

Accomplished using Firefox 62 with the DownloadStar extension, i.e. each proceedings page was visited manually but DownloadStar automated the download of all pdfs

## PDF to XML

Using [GROBID command line](https://grobid.readthedocs.io/en/latest/Grobid-batch/)

In [None]:
function doGrobid() {
    java -Xmx4G -jar /z/aolney/repos/grobid-0.5.1/grobid-core/build/libs/grobid-core-0.5.1-onejar.jar -gH /z/aolney/repos/grobid-0.5.1/grobid-home -dIn $1 -dOut $2 -exe processFullText
}

#These were performed on a previous run
# doGrobid /y/corpora/EDM-2011 /y/corpora/EDM-2011
# doGrobid /y/corpora/EDM-2014 /y/corpora/EDM-2014
# doGrobid /y/corpora/EDM-2015 /y/corpora/EDM-2015
# doGrobid /y/corpora/EDM-2016 /y/corpora/EDM-2016
# doGrobid /y/corpora/EDM-2017 /y/corpora/EDM-2017
# doGrobid /y/corpora/EDM/EDM-2018 /y/corpora/EDM/EDM-2018
# doGrobid /y/corpora/EDM/EDM-2019 /y/corpora/EDM/EDM-2019
# doGrobid /y/corpora/EDM/EDM-2020 /y/corpora/EDM/EDM-2020
# doGrobid /y/corpora/EDM/EDM-2021 /y/corpora/EDM/EDM-2021
# doGrobid /y/corpora/AIED/AIED2018 /y/corpora/AIED/AIED2018
doGrobid /y/corpora/AIED/AIED2020 /y/corpora/AIED/AIED2020
doGrobid /y/corpora/AIED/AIED2021 /y/corpora/AIED/AIED2021

## XML to usable text

In [2]:
open System.IO
let rec allFiles dirs =
    if Seq.isEmpty dirs then Seq.empty else
        seq { yield! dirs |> Seq.collect Directory.EnumerateFiles
              yield! dirs |> Seq.collect Directory.EnumerateDirectories |> allFiles }
let files = 
    [|"/y/corpora/EDM";"/y/corpora/AIED"|]
    |> allFiles
    |> Seq.filter( fun x -> x.EndsWith(".xml") )
                  
//files |> Seq.toArray


In [3]:
// check how many files are going in; not all will be extractable
files |> Seq.length

2718

In [6]:
type ReviewerInfo =
    {
        Name: string
        Order: int
        Hash : int
        //Some of these are really only useful for debugging
        Title : string
        Text : string
        File : string
    }
    
let GetName first last =
    match first,last with
    | Some(f), Some(l) -> f + " " + l
    | None, Some(l) -> l
    | Some(f), None -> f
    | None, None -> ""
    
//let xml = files |> Seq.head |> System.IO.File.ReadAllText

let spaceRegex = new System.Text.RegularExpressions.Regex(@"\s+");
let NormalizeText ( text : string ) =
    spaceRegex.Replace( text, " " ).Trim().ToLower()

let ExtractInfo xmlFile = 
    let doc = new System.Xml.XmlDocument();
    let xml = xmlFile |> System.IO.File.ReadAllText
    doc.LoadXml(xml);

    let nsmgr = new System.Xml.XmlNamespaceManager(doc.NameTable)
    nsmgr.AddNamespace("tei",  "http://www.tei-c.org/ns/1.0")

    let theTitle  = doc.SelectSingleNode(@"//tei:title", nsmgr).InnerText
    let theAbstract = doc.SelectSingleNode(@"//tei:abstract", nsmgr).InnerText
    let theText = 
        doc.SelectNodes(@"//tei:text//tei:p", nsmgr) 
        |> Seq.cast< System.Xml.XmlNode> 
        |> Seq.map( fun x -> 
                   let directChildren = x.ChildNodes |> Seq.cast< System.Xml.XmlNode> |> Seq.map (fun x -> x.Value) //unlike InnerText, ignores child descendent nodes
                   String.concat " " directChildren
                  )
        |> String.concat " "
        |> NormalizeText

    let authors = doc.SelectNodes(@"//tei:sourceDesc//tei:persName", nsmgr)
    let reviewerInfos =
        authors 
        |> Seq.cast< System.Xml.XmlNode> 
        |> Seq.mapi( fun i x -> 
                   let forename = 
                       match x.["forename"] with
                       | null -> None
                       | f -> Some(f.InnerText)
                   let surname = 
                       match x.["surname"] with
                       | null -> None
                       | s -> Some(s.InnerText)
                    //originally we hashed on the text, but the file name has the full path, so that is probably better
                   //{Name = (GetName forename surname); File = xmlFile; Text = theText; Order = i; Title = theTitle; Hash = (hash theText) }
                   {Name = (GetName forename surname); File = xmlFile; Text = theText; Order = i; Title = theTitle; Hash = (hash xmlFile) }
                  )
    //
    if reviewerInfos |> Seq.isEmpty then
        None
    else
        Some( reviewerInfos )


In [5]:
let reviewerInfosWithHash =
    files
    |> Seq.choose ExtractInfo 
    |> Seq.collect id
    //|> Array.ofSeq

//for compression purposes, map hash to small integer
let hashHash =
    reviewerInfosWithHash
    |> Seq.map( fun ri -> ri.Hash )
    |> Seq.distinct
    |> Seq.mapi( fun i h -> h,i)
    |> Map.ofSeq

let reviewerInfos = 
    reviewerInfosWithHash
    |> Seq.map( fun ri -> { ri with Hash=hashHash.[ri.Hash]})
    
let authorOutput = 
    reviewerInfos
    |> Seq.map( fun ri ->  ri.Hash.ToString() + "\t" + ri.Name + "\t" + ri.Order.ToString()  )

//the data layout is expected by pke
let textOutput =
    reviewerInfos
    |> Seq.distinctBy( fun ri -> ri.Hash )
    |> Seq.map( fun ri -> ri.Hash.ToString() + "\t" + ri.Text + "\t" + ri.Title + "\t" + ri.File)
    

In [None]:
//This cell sometimes stalls even though all files are written; may need to restart kernel as result
System.IO.File.WriteAllLines( "authors.tsv", authorOutput )
System.IO.File.WriteAllLines( "texts.tsv", textOutput )
System.IO.Directory.CreateDirectory( "texts") |> ignore
for ri in reviewerInfos |> Seq.distinctBy( fun ri -> ri.Hash ) do
    let outputPath = System.IO.Path.Combine( "texts", ri.Hash.ToString() + ".txt" )
    System.IO.File.WriteAllText( outputPath, ri.Text )

## Get Keyphrases

We use [PKE KP-Miner](https://boudinfl.github.io/pke/build/html/unsupervised.html#kpminer) b/c it has [the best unsupervised performance on SemEval2010](http://aclweb.org/anthology/C16-2015).

### First we build a document frequency model based on our corpus.

In [1]:
#NLTK dependencies that must be installed first
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype=np.int):
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  method='lar', copy_X=True, eps=np.finfo(np.float).eps,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  method='lar', copy_X=True, eps=np.finfo(np.float).eps,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_Gram=True, verbose=0,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.floa

True

In [2]:
from pke import compute_document_frequency
from string import punctuation

# path to the collection of documents
input_dir = '/z/aolney/repos/jedm-reviewer-finder/DataModel/texts/'

# path to the DF counts dictionary, saved as a gzip tab separated values
output_file = 'edm-df.gz'

# compute df counts and store stem -> weight values
compute_document_frequency(input_dir=input_dir,
                           output_file=output_file,
                           format="raw",            # input files format
                           use_lemmas=False,    # do not use Stanford lemmas
                           stemmer="porter",            # use porter stemmer
                           stoplist=list(punctuation),            # stoplist
                           delimiter='\t',            # tab separated output
                           extension='txt',          # input files extension
                           n=5)              # compute n-grams up to 5-grams

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  EPS = np.finfo(np.float).eps


### Use KPMiner with our custom DF

In [3]:
import pke

output = []

#load df
df = pke.load_document_frequency_file(input_file='/z/aolney/repos/jedm-reviewer-finder/DataModel/edm-df.gz')


#must loop over each document in texts
with open('/z/aolney/repos/jedm-reviewer-finder/DataModel/texts.tsv') as inputFile:
    for line in inputFile:
        split = line.split("\t")
        text = split[1]
        hashCode = split[0]
        
        # 1. create a KPMiner extractor.
        extractor = pke.unsupervised.KPMiner(language='english')

        # 2. load the content of the document.
        #extractor.read_document(format='raw')
        extractor.read_text(text)

        # 3. select {1-5}-grams that do not contain punctuation marks or
        #    stopwords as keyphrase candidates. Set the least allowable seen
        #    frequency to 5 and the number of words after which candidates are
        #    filtered out to 200.
        lasf = 3 #5
        cutoff = 400 #200
        stoplist = nltk.corpus.stopwords.words("english")
        extractor.candidate_selection(lasf=lasf, cutoff=cutoff, stoplist=stoplist)

        # 4. weight the candidates using KPMiner weighting function.
        alpha = 2.3
        sigma = 3.0
        extractor.candidate_weighting(df=df, alpha=alpha, sigma=sigma)

        # 5. get the 10-highest scored candidates as keyphrases
        keyphrases = extractor.get_n_best(n=10)
        
        #NOTE: we convert score float to int for compression purposes
        for phrase,score in keyphrases:
            output.append( hashCode + "\t" + phrase + "\t" + str(int(score)) + "\n" )
        
#write out
with open("keys.tsv", "w") as f:
    f.writelines(output)

## Search models

Much of this section is still exploratory.

### Test document

We likely need to be specific with users how much of text to use. It's unclear if we should disallow references since the titles are likely to have keywords and may not be exactly representative of the text itself. 

**References omitted from this test**

**Note text normalization uses same function as original keyword creation**

In [4]:
let spaceRegex = new System.Text.RegularExpressions.Regex(@"\s+");
let NormalizeText ( text : string ) =
    spaceRegex.Replace( text, " " ).Trim().ToLower()
    
let pdfText = """The Simple Location Heuristic is Better at
Predicting Students’ Changes in Error Rate Over
Time Compared to the Simple Temporal Heuristic
A.F. NWAIGWE
AMERICAN UNIVERSITY OF NIGERIA, NIGERIA
AND
K.R. KOEDINGER
CARNEGIE MELLON UNIVERSITY, U.S.A
________________________________________________________________________
In a previous study on a physics dataset from the Andes tutor, we found that the simple location
heuristic was better at making error attribution than the simple temporal heuristic when evaluated
on the learning curve standard. In this study, we investigated the generality of performance of the
simple location heuristic and the simple temporal heuristic in the math domain to see if previous
results generalized to other Intelligent Tutoring System domains. In support of past results, we
found that the simple location heuristic provided a better goodness of fit to the learning curve
standard, that is, it was better at performing error attribution than the simple temporal heuristic.
One observation is that for tutors where the knowledge components can be determined by the
interface location in which an action appears, using the simple location heuristic is likely to show
better results than the simple temporal heuristic. It is possible that the simple temporal heuristic is
better in situations where the different problem subgoals can be associated with a single location.
However, our prior results with a physics data set indicated that even in such situations the simple
location heuristic may be better. Further research should explore this issue.
Key Words and Phrases: Error attribution methods, Intelligent Tutoring Systems, learning curves, mathematics
________________________________________________________________________
1. INTRODUCTION
Increasingly, learning curves have become a standard tool for evaluation of Intelligent
Tutoring Systems (ITS) [Anderson, Bellezza & Boyle, 1993; Corbett, Anderson, &
O’Brien, 1995; Koedinger & Mathan, 2004; Martin, Mitrovic, Mathan, & Koedinger,
2005; Mathan & Koedinger, 2005; Mitrovic & Ohlsson, 1990] and measurement of
students’ learning [Anderson, Bellezza & Boyle, 1993; Heathcote, Brown, & Mewhort,
2002]. The slope of learning curves show the rate at which a student learns over time, and
reveals how well the tutor’s cognitive model fits what the student is learning. However,
these learning curves require a method for attributing error to the “knowledge
components” (skills or concepts) in the student model that the student is missing.
Knowledge components, concepts and skills will be used interchangeably in this paper. In
a previous study using data from the Andes Intelligent tutor [VanLehn et al., 2005], four
__________________________________________________________________________________________
Authors’ addresses: A.F. Nwaigwe, School of Information Technology and Communication, American
University of Nigeria, Nigeria; E-mail : adaeze.nwaigwe@aun.edu.ng; K.R. Koedinger, Human Computer
Interaction Institute, Carnegie Mellon University, U.S.A.; E-mail : koedinger@cmu.edualternative heuristics were evaluated - simple location heuristic (LH), simple temporal
heuristic (TH), model-based location heuristic (MLH) and model-based temporal
heuristic (MTH) [Nwaigwe et al., 2007]. When evaluated on the learning curve standard,
the two location heuristics LH and MLH, outperformed the temporal heuristics, TH and
MTH. However, the generality of performance of these heuristics in other ITS subject
domains needs to be tested.
In this study conducted in the mathematics domain, we investigated whether the
previous performance of the LH and TH generalized to other ITS domains. We
specifically asked if the LH was better than the TH at predicting student changes in error
rate over time. We used log data from a Cognitive Tutor on a Scatterplot lesson and
implemented the learning curves standard using the statistical component of Learning
Factors Analysis [Cen, Koedinger & Junker, 2005; Pirolli & Wilson, 1998].
Our intuition is that the LH may be the better choice for error attribution when
knowledge components (KCs) can be determined by the interface location where an
action occurs. To justify this, imagine that a worker has homes, H a and H b in which to
perform tasks A and B respectively. The worker goes to home H a and attempts task A but
fails. The worker abandons the failed task A and goes to home H b , where he/she succeeds
at task B. The assumption is that tasks A and B are associated with different KCs. The
worker later returns to location H a , and this time, is successful at task A. The LH will
more rationally attribute the initial failed attempt at H a to the KC associated with task A
since its rule is to attribute error to the first successfully implemented KC at the initial
error location. The TH will however, wrongfully put blame on the KC associated with
task B since its method of error attribution is to blame the KC associated with the first
correctly implemented task.
Sometimes, TH might be a better choice for making error attribution. We believe this
to be the case when it is necessary to perform a set of tasks in a prescribed sequence. To
elaborate, imagine that homeschooler Bella is required to perform two tasks and in the
given sequence – eat breakfast (EB), and do schoolwork (DS) and in any of two
locations, L1 and L2 on the dining table of the family’s apartment. We again assume that
tasks EB and DS are associated with different KCs. Bella decides that she did not like
what Mom served for breakfast that morning and goes straight to her schoolwork, DS, at
location L1, skipping task EB. However, Bella fails at task DS due to hunger associated
distractions. Later, she abandons task DS and revisits and succeeds at task EB at location
L2. Bella then goes back location to L1 and completes task DS. In attributing blame, TH
will rationally blame the KC associated with task EB. However the LH will wrongfully
blame the KC associated with task, DS. These examples imply that it may be better to
apply heuristics in making error attribution.
Although an immediate purpose for error attribution is to drive learning curve
generation, the assignment of blame problem is more general and affects many aspects of
student modeling.
2. ERROR ATTRIBUTION HEURISTICS
A basic assumption of many cognitive models is that knowledge can be decomposed into
components, that each component is learned independently of the others and that
implementation of a step in the solution of a problem is an attempt to apply one or more
knowledge components (KCs). When correct solution steps are generated, either by an
expert system or a human expert, the step is often annotated with the KCs that must be
applied in order to generate the step. Thus, when a student enters that step, the system
can infer that the student is probably (but not necessarily) applying those KCs.An ITS system can be designed to anticipate and generate some incorrect steps and
associated goals, however, it is rare for expert systems or expert authors to anticipate and
generate a large number of incorrect steps and corresponding goals. Hence, when the
student enters an incorrect step, it is often not clear what KC(s) should have been applied,
so the system cannot determine which KC(s) the student is weak on. If the system simply
ignores incorrect steps, then it only “sees” successful applications of KCs. It cannot
“see” failures of a KC. It may see lots of incorrect steps, but it cannot determine and
record what KC(s) to blame for each error [VanLehn et al, 2005] and so, learning curves
cannot be generated. This suggests using heuristics.
The tutoring system usually has two clues available: the location of the incorrect step
on the user interface and the subsequent steps entered by the student. For instance, if a
student makes an error on a step at time 1 and at location A, the student will often attempt
to correct it immediately, perhaps with help from the tutor. So if the first correct step, at
time 2 is also at location A, and say, that the step is annotated with KC x, then it is likely
that the incorrect step at time 1 was a failed attempt to apply KC x. This heuristic allows
the system to attribute errors to KCs whenever the system sees a correct step immediately
following the target incorrect step, and both steps are in the same location on the user
interface.
However, it is not clear how to generalize this heuristic. What if the next correct step
is not in the same location? What if there are intervening incorrect steps in different
locations? In previous work using data from the Andes Physics Tutor, four automated
heuristics for making error attribution (LH, TH, MLH, MTH) were proposed and
evaluated guided by whether the heuristic was driven by location or by the temporal order
of events [Nwaigwe et al, 2007].
For every error transaction, LH attributes blame to the KC mapped to a subsequent
correct entry at the widget location where the error occurred [Anderson, Bellezza &
Boyle,1993; Koedinger & Mathan, 2004; Martin, Mitrovic, Mathan, & Koedinger, 2005]
while the TH ascribes blame to the KC that labels the first correct entry in time. When
there is no subsequent correct entry with a label of the error location, LH blames the KC
with the first correct entry in time, that is, it implements the behavior of TH. When the
tutor provides a choice of some KC to blame for an error, the MLH goes with the tutor’s
choice otherwise, it simply implements the LH. For an error transaction, MTH also goes
with the domain model’s choice if one exists, otherwise it implements the TH.
In this work, we examine the performance of the LH and TH in the mathematics
domain. Table I shows sample log transaction from the cognitive tutor for the scatterplot
lesson. The table illustrates how the LH and the TH can help resolve the error attribution
ambiguity. Columns in table 1 are described thus: “location” column indicates the place
on the interface (the interface widget) in which the student made an input; “Outcome”
indicates if an input is correct or not, while “Student Model KC” lists the system’s choice
of KC which the student should implement.
In row 1, the student makes an error at the location labeled, “var-0val-1”. The system
however does not indicate the KC the student ought to be practicing. To resolve this
ambiguity, the LH uses the KC that labels a subsequent correct entry in the same location
– see row 5. That is, it chooses “choose variable”. On the other hand, the TH chooses the
KC that labels the first correct entry in time, irrespective of interface location. Its choice
is “label x-axis”. In row #2, the domain model blames the KC “choose variable” for the
student’s error. LH chooses “choose variable” since it is the first correctly implemented
KC at the location “var-0val-1”. TH blames the KC “label x-axis” in this case.
In the prior study, the cognitive model generated by the LH was found to outperform
that of the TH and also, the tutor’s original model according to the learning curvestandard. In other words, the LH was better at making error attributions than the other
two cognitive models. Compared to the TH, we also found that the error attribution
method of the LH was more like that made by human coders. In this work, we conduct
our analysis in the math domain and compare the performance of the LH to that of the
TH based on the learning curve standard. Our goal is to see if the previous performance
of the LH and TH can be generalized to other intelligent tutoring system domains.
Table I. Table illustrating different error attributions made by the 2 methods
Outcome Student Model KC Error Attributions methods
KC
LH
TH
incorrect choose variable label x-axis
2
3 var-0val-1 incorrect
var-1val-1 correct choose variable choose variable
label x-axis
label x-axis label x-axis
label x-axis
4 var-0val-1 incorrect choose variable choose variable
5 var-0val-1 correct choose variable choose variable choose variable
#
1
Location
var-0val-1
3. LEARNING CURVES
Learning curves plot the performance of students with respect to some measure of their
ability over time [Anderson, Bellezza & Boyle, 1993; Corbett, Anderson, O’Brien, 1995;
Koedinger & Mathan, 2004; Martin, Mitrovic, Mathan, & Koedinger, 2005; Mathan &
Koedinger, 2005; Mitrovic & Ohlsson, 1990]. For ITSs, the standard approach is to
measure the proportion of knowledge components in the domain model that have been
“incorrectly” applied by the student. This is also known as the “error rate”. Other
alternatives exist, such as the number of attempts taken to correct a particular type of
error. Time is generally represented as the number of opportunities to practice a KC or
skill. This in turn may be determined in different ways: for instance, it may represent
each new step a student attempts that is relevant to the skill, on the basis that repeated
attempts at the KC are benefiting from the student having been given feedback and as-
needed instruction about that particular skill and hence may improve from one attempt to
the next. If the student is learning the KC or skill being measured, the learning curve will
follow a so-called “power law of practice” [Mathan & Koedinger, 2005]. If such a curve
exists, it presents evidence that the student is learning the skill being measured or
conversely, that the skill represents what the student is learning.
3.1. THE LEARNING CURVES STANDARD
The power law applies to individual skills and does not take into account student effects.
The statistical component of Learning Factors Analysis (LFA) extends the power law to a
logistic regression model which accommodates student effects for a cognitive model
incorporating multiple knowledge components and multiple students [Cen, Koedinger, &
Junker, 2005], see equation 1. The following are the assumptions on which equation 1 is
based:
1. Different students may know more or less initially. An intercept parameter of
this model reflects each student’s initial knowledge.2.
3.
4.
Students learn at the same rate. Thus, slope parameters do not depend on the
student. Slope parameters reflect the learning rate of each KC which the student
model encompasses and are independent of student effect. This assumption
made so as to reduce the number of parameters in equation 1 and is further
justified since equation 1 is focused on refining a cognitive model rather than on
evaluating students’ knowledge growth [Draney, Pirolli & Wilson, 1995].
Some KCs are more likely known than others. An intercept parameter for each
KC captures initial difficulty of the skill.
Since some KCs are easier to learn than other, the model of equation 1 uses a
slope parameter to reflect this for each skill. Larger values for initial difficulty
reflect tougher skills.
ln[p/(1-p)] = ! " i X i + ! # j Y j + ! $ j Y j T j .
(1)
where p – the probability of success at a step performed by student i that requires
knowledge component j; X i and Y j – the dummy variable vectors for students and
knowledge components respectively; T j – the number of practice opportunities student i
has had on knowledge component j; ! i – the coefficient that models student i’s initial
knowledge; " j – the coefficient that reflects the initial difficulty of knowledge component
j where larger values of initial difficulty reflect tougher skills; # j – the coefficient that
reflects the learning rate of knowledge component j, given its practice opportunity.
In this paper, the model of equation 1 is used to apply the learning curve standard.
Bayesian Information Criterion (BIC) [Wasserman, 2004] is used to estimate prediction
risk in the model while loglikelihood is used to measure model fit. Lower BIC scores,
mean a better balance between model fit and complexity.
4. DATA SOURCE
The data used for this research was collected as part of a study conducted in a set of 5
middle-school classrooms at 2 schools in the suburbs of a medium-sized city in the
Northeastern United States. Student ages ranged approximately from 12 to 14 years. The
classrooms studied were taking part in the development of a new 3-year cognitive tutor
curriculum for middle school mathematics [Baker, 2005; Baker., Corbett, Koedinger &
Wagner, 2004]. Data collected was from the study on these classrooms during the course
of a short (2 class periods) cognitive tutor unit on scatterplot generation and
interpretation. Scatterplots depict the relationship between two quantitative variables in a
Cartesian plane, using a point to represent paired values of each variable.
The scatterplot lesson consisted of a set of problems and for each problem, a student
was given a data set to generate a graph. The student then had to choose from a list, the
variables that were appropriate for use in the scatterplot (see figure 1); those that where
quantitative or categorical; and subsequently whether a chosen variable was appropriate
for a bar chart.
Next the student was required to label the X and Y-axis (see figure 2), and to choose
each axis bound and scale. The student was then required to plot points on the graph by
clicking on the desired position on the graph. Finally, the student was required to answer
a set of interpretation questions to reason about the graph’s trend, outliers, monotonicity,
and extrapolation and in comparison with other graphs. In our dataset, students solved a
maximum of six problems and a minimum of two in the scatterplot lesson.Figure 1 Scatterplot lesson interface for choosing variable type [Baker, 2005]
Figure 2 Interface for graph creation in the scatterplot lesson [Baker, 2005]5. METHODOLOGY
The algorithms for the LH and TH used in this research was implemented in pure java 1.6
and designed to process student log data in MS Excel format. Both algorithms used
sequential search. Log data from the cognitive tutor unit on scatterplot generation and
interpretation served as input to the programs. The output from each program was the
choice of KC codes made by the heuristic being implanted as explained in section 2.
To analyze the cognitive model of each heuristic according to the learning curve
standard, the data output from each program was then fit to equation 1 to derive learning
behavior. The coefficients of equation 1, initial KC difficulty ( " j ), initial student difficulty
( ! i ) and KC learning rate (# j ) were used to describe learning behavior for each heuristic. If
the intercept of a KC was higher, then, its initial difficulty was lower. Further, if the slope
of each KC was higher, then, the faster students learned that skill. For the model of each
heuristic, BIC score was used to estimate prediction risk while loglikelihood was used to
measure model fit.
6. RESULTS AND DISCUSSION
Table II summarizes the results of the learning curve standard for the student models for
both the LH and TH. The results show that the simple location heuristic, LH (BIC score:
7,510.12) shows better fit to the learning curve standard compared to the simple temporal
heuristic, TH (BIC scores: 7,703.58). This means that the model of the LH is more
reliable and so, a prediction error is more likely to occur if one used the TH model.
Loglikelihood score was also better for the LH (-3,370.37) than for the TH (-3,464.93),
indicating that the LH model was a better fit to the data than the competing TH model.
This shows how the different error attribution methods affect the result.
Table II. Results of the Learning Curve Standard
logLikelihood TH
-3,464.93 LH
-3,370.37
BIC 7,703.58 7,510.12
Learning Rate (! j ) Mean (Std) 0.09 (0.09) 0.133 (0.11)
Initial
KC
Difficulty ( " j )
Initial
Student
Difficulty ( # i )
# of KCs Mean (Std) -1.81 (0.94) 0.08 (1.10)
Mean (Std) 2.03 (0.61) -0.00 (0.63)
17 17
# of transactions across entire
scatterplot lesson 16,291 16,291
# of students 52 52Table III. Knowledge Component Details for the two Heuristics
Knowledge
Component (KC)
CHOOSE-VAR-TYPE-
CAT
MMS-VALUING-
DETERMINE-SET-
MAX
MMS-VALUING-
DETERMINE-SET-
MIN
QUANTITATIVE-
VALUING-FIRST-BIN
QUANTITATIVE-
VALUING-SECOND-
BIN
MMS-VALUING-
LABELSUSED
CHOOSE-VAR-TYPE-
NUM
MMS-VALUING-
DETERMINE-SCALE
MMS-VALUING-
LABELSUSED-PLUS2
TEST-SLOPE
CHOOSE-OVERALL-
REL
EXTRAPOLATE
CHOOSE-OK-BG
CHOOSE-X-AXIS-
QUANTITATIVE
CHOOSE-Y-AXIS-
QUANTITATIVE
MMS-VALUING-
DETERMINE-MIN
MMS-VALUING-
DETERMINE-RANGE
Simple Temporal Heuristic
(TH)
Ave
" j
! j
(Initial
Opp
(learning
difficulty) rate) Simple Location Heuristic
(LH)
Ave
" j
! j
(Initial
Opp
(learning
difficulty) rate)
6.6 -1.449 0.076 6.6
0.048
0.244
6.9 -0.793 0 6.2
1.275
0
6.3 -0.361 0.031 6.2
1.587
0.063
6.1 -2.642 0.159 5.5
-0.565
0.163
5.6 -0.947 0 5.4
0.879
0.052
6.9 -2.625 0.049 5.8
-0.942
0.219
18.2 -1.044 0.038 16.2
0.799
0.044
53.1 -0.069 0.007 50.2
2.364
0
5.9
3.3 -1.99
-2.213 0.063
0.131 5.8
3.3
-0.805
0.238
0.187
0
5.0
1.7
11.5 -3.175
-2.093
-1.708 0.257
0
0.198 5.2
1.5
11.4
-0.865
0.206
0.263
0.149
0
0.215
4.3 -2.572 0.274 3.2
-0.719
0.314
3.7 -2.618 0.018 3.2
-1.884
0.276
6.3 -3.053 0.119 5.8
-1.106
0.139
6.3 -1.357 0.147 6.1
0.584
0.196
Generally, we observed that, the LH performed better than the TH when the student
failed to successfully complete an attempted step and subsequently attempted and
succeeded at a different step. As shown in table I, the student unsuccessfully attempted a
step at location “var-0val-1” (trn # 1 & 2). The student subsequently went to location
“var-1val-1”, attempted and succeeded at the new step. While the TH incorrectly blamed
“label x-axis” which is the KC associated with the new step at location “var-1val-1”, the
LH more rationally blamed “choose variable” which is the KC that should be associated
with the step at location “var-0val-1”. Because the LH uses location for error attribution,
it correctly assigns blame to the KC associated with the error. TH however, wrongfully
blames the first subsequent KC that the student correctly attempts. Of the 16,291transactions in our dataset, error transactions recorded were 5,733. Of the latter, the LH
and TH differed on 1,583 (36%) transactions with respect to error attribution choices.
We also found that both the LH and the TH had the tendency to yield the same result
when the student succeeded at a step, even after multiple attempts, prior to attempting
and succeeding at a new step. This was the case 64% of the time.
In table III, average practice opportunity, initial KC difficulties and learning rates are
given for KCs and used to describe learning behavior for each heuristic. For example, for
the KC “CHOOSE-VAR-TYPE-CAT”, the learning rate (! j ) for the LH was more than 3
times that of the TH. Judging by KC initial difficulty (" j ), “CHOOSE-VAR-TYPE-CAT”
appeared more difficult for the model of the TH (-1.449) than for the model of the LH
(0.244). The average practice opportunity measured for that skill (6.6), was the same for
each heuristic. The latter means that on the average, each student had approximately 7
opportunities to practice the KC “CHOOSE-VAR-TYPE-CAT”.
From table III, for the most part, KC learning rate was higher for the skills in the
cognitive model of the LH compared to that for the TH. The trend for initial KC
difficulty was in the opposite direction as seen for KCs such as “MMS-VALUING-
DETERMINE-SET-MIN”,
“QUANTITATIVE-VALUING-SECOND-BIN”,
etc.
Generally, KCs in the cognitive model for TH appeared more difficult to students
initially, when compared to similar KCs in the cognitive model of the LH.
From table II, the mean learning rate for the LH was 0.133(+0.11) which evaluated
higher than that of the TH, 0.09(+0.09). The mean initial KC difficulty for the LH and
TH were 0.08(+1.1) and -1.84(+0.94) respectively. The reason for the latter seems to be
due to more errors being attributed to later opportunities in the TH than the LH. These
results thus illustrate the effects of error attribution.
7. CONCLUSION
In this paper, we investigated the generality of performance of two alternative methods
for making error attribution in intelligent tutoring systems - the simple location heuristic
and the simple temporal heuristic. Our study was carried out in the mathematics domain
using data from a cognitive tutor unit on scatterplot generation and interpretation. In
support of previous results obtained in the physics domain, we found that the simple
location heuristic was better at predicting students’ changes in error rate over time
compared to the simple temporal heuristic. This work shows that simpler, easier-to-
implement methods can be effective in the process of making error attribution.
One observation is that for tutors where the KCs can be determined by the interface
location (or widget) in which an action appears it is likely that the LH will show better
results than the TH. This feature is mostly true of the scatterplot tutor. It is possible that
the TH is better in situations where the different problem subgoals can be associated with
a single location. However, our prior results with a physics data set indicated that even in
such situations the LH may be better. Further research should explore this issue.
We also intend to investigate whether the use of the simple location-based heuristic
may improve on-line student modeling and associated future task selection. The
availability of datasets from the Pittsburgh Science of Learning Center’s ‘DataShop’ (see
http://learnlab.org) will facilitate the process of getting appropriate data.""" |> NormalizeText


### Model 1: Keyphrase frequencies

Keyphrases that occur across many documents probably more useful than those that occur once. 

In [None]:
let keyCounts = 
    "/z/aolney/repos/jedm-reviewer-finder/DataModel/keys.tsv"
    |> System.IO.File.ReadAllLines
    |> Seq.map( fun line -> line.Split("\t").[1])
    |> Seq.countBy id
    |> Seq.sortByDescending( fun (k,v) -> v)
    |> Seq.toArray
    
//System.IO.File.WriteAllLines( "keyFrequencies.tsv", keyCounts |> Seq.map( fun (k,v)-> k + "\t" + v.ToString()))

let keyMatches =
    keyCounts
    |> Seq.choose( fun (k,v) -> if pdfText.IndexOf(k) <> -1 then Some(k,v) else None )
    |> Seq.sortByDescending snd
    |> Seq.truncate 30 //return N most freq
    |> Seq.toArray
    
keyMatches

[|("et al", 104); ("course", 73); ("tutoring systems", 66); ("skills", 58);
  ("feedback", 48); ("tutoring", 41); ("math", 37); ("knowledge components", 37);
  ("student model", 36); ("intelligent tutoring", 35); ("kcs", 33); ("code", 30);
  ("skill", 28); ("student modeling", 28); ("words", 24); ("concepts", 23);
  ("tutoring system", 23); ("sequence", 22); ("actions", 22); ("user", 21);
  ("log data", 21); ("classroom", 21); ("word", 21); ("concept", 21);
  ("questions", 21); ("cognitive model", 18); ("problems", 16);
  ("parameters", 16); ("data set", 16); ("students learn", 16)|]

### RESULT

This looks plausible though we note that some keywords aren't great. 

## Model 2: Keyphrases Frequencies Weighted by Score

We note that *each occurrance has it's own score*. Therefore instead of summing occurrences we sum scores.


In [5]:
#r "/z/aolney/repos/nuget-libraries/Newtonsoft.Json.9.0.1/lib/net40/Newtonsoft.Json.dll"

//Types for exploration
type HashKeyScore =
    {
        Hash : int
        Key : string
        Score : int
    }

type HashAuthorOrder =
    {
        Hash : int
        Author : string
        Order : int
    }

type HashTitleFile =
    {
        Hash : int
        Title : string
        File : string
    }
    
//Maps for exploration

let textIdToTextMap =
    "/z/aolney/repos/jedm-reviewer-finder/DataModel/texts.tsv"
    |> System.IO.File.ReadAllLines
    |> Seq.map( fun line ->
               let s = line.Split("\t")
               let htf = {Hash=System.Int32.Parse(s.[0]); Title= s.[2]; File=s.[3]}
               htf.Hash,htf
              )
    //|> Seq.groupBy( fun htf -> htf.Hash )
    |> Map.ofSeq
    
let keyToTextIdMap =
    "/z/aolney/repos/jedm-reviewer-finder/DataModel/keys.tsv"
    |> System.IO.File.ReadAllLines
    |> Seq.map( fun line ->
               let s = line.Split("\t")
               {Hash=System.Int32.Parse(s.[0]); Key= s.[1]; Score=System.Int32.Parse(s.[2])}
              )
    |> Seq.groupBy( fun hks -> hks.Key )
    |> Map.ofSeq
    
let textIdToAuthorMap = 
    "/z/aolney/repos/jedm-reviewer-finder/DataModel/authors.tsv"
    |> System.IO.File.ReadAllLines
    |> Seq.map( fun line ->
               let s = line.Split("\t")
               {Hash=System.Int32.Parse(s.[0]); Author= s.[1]; Order=System.Int32.Parse(s.[2])}
              )
    |> Seq.groupBy( fun hao -> hao.Hash )
    |> Map.ofSeq
    
//key score is the sum of scores across documents, not the count of docs key appears in
let keyScoreMap =
    keyToTextIdMap
    |> Seq.map (|KeyValue|)
    |> Seq.map( fun (k,hksList)-> k, hksList |> Seq.sumBy( fun hks -> hks.Score))
    |> Map.ofSeq

//--------------------------------------------------
// DEPLOYMENT

//Short types for compression
type IdScore =
    {
        I: int
        S: int
    }
type IdTitle =
    {
        I: int
        T: string
    }
type KeyScore =
    {
        K: string
        S: int
    }
type AuthorOrder =
    {
        A: string
        O: int
    }
type KeyIdScore =
    {
        K: string
        I: IdScore[]
    }
type IdAuthorOrder =
    {
        I: int
        A: AuthorOrder[]
    }
    
//Small maps for compression
//UPDATE: it seems the easiest way is to map to primitive JS objects and recombine later

let JSFunctionToFile( o : obj ) (name : string ) =
    let json = Newtonsoft.Json.JsonConvert.SerializeObject( o ) //, Newtonsoft.Json.Formatting.Indented )
    let myFunction = "module.exports = { " + name + ":function(){ \n return " + json + "; \n } };"
    System.IO.File.WriteAllText( name + ".js", myFunction)

let textIdToTextShort =
    textIdToTextMap
    |> Seq.map (|KeyValue|)
    |> Seq.sortBy( fun(h,_) -> h)
    |> Seq.map (fun (h,htf) -> htf.Title)
    |> Seq.toArray

let keyToTextIdShort =
    keyToTextIdMap
    |> Seq.map (|KeyValue|)
    |> Seq.map (fun (k,hksList) -> 
                let idScores = hksList |> Seq.map( fun hks -> {I=hks.Hash; S=hks.Score} ) |> Seq.toArray
                {K=k;I=idScores}
               )
    |> Seq.toArray

let textIdToAuthorShort = 
    textIdToAuthorMap
    |> Seq.map (|KeyValue|)
    |> Seq.map (fun (h,haoList) -> 
                let authorOrders =  haoList |> Seq.map( fun hao -> {A=hao.Author; O=hao.Order} ) |> Seq.toArray
                {I=h; A=authorOrders}
               )
    |> Seq.toArray

let keyTotalScoreShort =
    keyScoreMap
    |> Seq.map (|KeyValue|)
    |> Seq.map (fun (k,s) -> {K=k;S=s})
    |> Seq.toArray
    
//Format each map as a JS function
JSFunctionToFile textIdToTextShort "idTitle"
JSFunctionToFile keyToTextIdShort "keyIdScore"
JSFunctionToFile textIdToAuthorShort "idAuthorOrder"
JSFunctionToFile keyTotalScoreShort "keyTotalScore"

//---------------------------------------

//Evaluate on test document
let keyMatches =
    keyScoreMap
    |> Seq.map (|KeyValue|)
    |> Seq.choose( fun (k,v) -> if pdfText.IndexOf(k) <> -1 then Some(k,v) else None )
    |> Seq.sortByDescending snd
    |> Seq.truncate 30 //return N most freq
    |> Seq.toArray
    
keyMatches

[|("et al", 26237); ("student modeling", 18818); ("student model", 13868);
  ("learning curves", 12559); ("math", 12108); ("course", 11600);
  ("tutoring systems", 11519); ("kcs", 10995); ("knowledge components", 10801);
  ("skills", 10238); ("cognitive model", 9766); ("feedback", 8520);
  ("learning curve", 7469); ("subgoal", 7202); ("tutoring", 6874);
  ("data set", 6690); ("code", 6344); ("tutoring system", 6308);
  ("student models", 5690); ("log data", 5596); ("intelligent tutoring", 5516);
  ("concept", 5289); ("men", 5260); ("map", 4684); ("skill", 4656);
  ("concepts", 4337); ("rule", 4292); ("logistic regression", 4268);
  ("tas", 4226); ("lesson", 4083)|]

### RESULT

This looks better; it recovered learning curve. It suppressed a number of "junk" words that occurred when we used pure frequency like `user, words, exam, and tasks`

### Author suggestion using this model

For each match get associated hash ids and from them associated authors. Weight the authors based on i) the proportion their paper expresses these keywords ii) the author's rank

In [6]:
keyMatches
//at this level we just aggregate information we might use
|> Seq.collect( fun (k,v)-> 
           keyToTextIdMap.[k] //hks list of all texts key appeared in
           |> Seq.collect( fun hks -> 
                      textIdToAuthorMap.[hks.Hash] //list of authors for that text
                      |> Seq.map( fun hao -> k, hao, hks.Score, v)
                      |> Seq.distinct
                     )
          )
//at this level we explore different metrics
//normalize keyword score for particular reviewer's paper by total for that keyword in corpus
//then normalize by the author's position in author order
|> Seq.map( fun (k, hao, score, total) -> 
           let htf = textIdToTextMap.[hao.Hash]
           k, hao.Author, (float(score)/float(total)) / (float(hao.Order)+1.0), htf.File, htf.Title ) 
//at this level we group on keyword and sort matches descending by score
|> Seq.groupBy(fun (k,a,s,f,t) -> k )
|> Seq.map( fun (k,v) -> k, v |> Seq.sortByDescending( fun (k,a,s,f,t) -> s ) )
//|> Seq.sortByDescending(fun (k, a,s,f,t) -> k,s )
//|> Seq.truncate 20
|> Seq.toArray

[|("et al",
   seq
     [("et al", "Yiqiao Xu", 0.03117734497,
       "/y/corpora/EDM/EDM-2018/EDM2018_paper_64.tei.xml",
       "How many friends can you make in a week?: evolving social relationships in MOOCs over time");
      ("et al", "Yiqiao Xu", 0.03117734497,
       "/y/corpora/EDM/EDM-2018/wVFgXNCts2/EDM2018_paper_64.xml",
       "How many friends can you make in a week?: evolving social relationships in MOOCs over time");
      ("et al", "Oluwabukola Mayowa Ishola", 0.02077219194,
       "/y/corpora/EDM/EDM-2017/paper_75.tei.xml",
       "Predicting Prospective Peer Helpers to Provide Just-In-Time Help to Users in Question and Answer Forums");
      ("et al", "Oluwabukola Mayowa Ishola", 0.02077219194,
       "/y/corpora/EDM/EDM-2017/S3dBUBVNxJ/paper_75.xml",
       "Predicting Prospective Peer Helpers to Provide Just-In-Time Help to Users in Question and Answer Forums");
      ...]);
  ("student modeling",
   seq
     [("student modeling", "Kenneth Holstein", 0.1595281114,
 

### RESULT

Looks very reasonable. Based on this we:

- Optimize the JSON as much as seems reasonable (minify)
- Write a Fable app that consumes the JSON
- [Fable publish](https://github.com/fable-compiler/static-page-generator) to [gh-pages](https://help.github.com/articles/configuring-a-publishing-source-for-github-pages/) (which uses gzip compression)

If the string search is too slow in Fable, we have the option of using [Aho-Corasick string matching](https://github.com/tombooth/aho-corasick.js). However, local testing suggests this is not a problem.