Skip to content

Commit

Permalink
added ACL paper, minor updates
Browse files Browse the repository at this point in the history
  • Loading branch information
burrsettles committed Jun 10, 2016
1 parent 0dd1342 commit a910224
Show file tree
Hide file tree
Showing 3 changed files with 52 additions and 19 deletions.
56 changes: 39 additions & 17 deletions README.md
Expand Up @@ -6,7 +6,7 @@ Half-life regression (HLR) is a model for spaced repetition practice, with parti

This repository contains a public release of the data and code used for several experiments in the following paper (which introduces HLR):

> B. Settles and B. Meeder. 2016. A Trainable Spaced Repetition Model for Language Learning.
> B. Settles and B. Meeder. 2016. [A Trainable Spaced Repetition Model for Language Learning](settles.acl16.pdf).
> In _Proceedings of the Association for Computational Linguistics (ACL)_, to appear.
When using this data set and/or software, please cite this publication. A BibTeX record is:
Expand All @@ -24,36 +24,58 @@ When using this data set and/or software, please cite this publication. A BibTeX

## Software

The file ``experiment.py`` contains a Python implementation of half-life regression, as well as several baseline spaced repetition algorithms used in Section 4.1 of the paper above. It implemented in pure Python, and we recommend using [pypy](http://pypy.org/) on large datasets for efficiency. The software creates the subfolder ``results/`` which contains model predictions on the test set and induced model weights for inspection.
The file ``experiment.py`` contains a Python implementation of half-life regression, as well as several baseline spaced repetition algorithms used in Section 4.1 of the paper above. It is implemented in pure Python, and we recommend using [pypy](http://pypy.org/) on large data sets for efficiency. The software creates the subfolder ``results/`` for outputting model predictions on the test partition and induced model weights for inspection.

The file ``evaluation.r`` contains an R function, ``sr_evaluate()``, which takes a prediction file from the script above, and implements the three key metrics we use for evaluation: mean absolute error (MAE), area under the ROC curve (AUC), and Spearman correlation for estimated half-life. Significance tests are also included.
The file ``evaluation.r`` implements an R function, ``sr_evaluate()``, which takes a prediction file from the script above and implements the three metrics we use for evaluation: mean absolute error (MAE), area under the ROC curve (AUC), and Spearman correlation for estimated half-life. Significance tests are also included.


## Dataset and Format
## Data Set and Format

The dataset is available here: [settles.acl16.learning_traces.13m.csv.gz](https://s3.amazonaws.com/duolingo-papers/publications/settles.acl16.learning_traces.13m.csv.gz) (361 MB).
The data set is available here: [settles.acl16.learning_traces.13m.csv.gz](https://s3.amazonaws.com/duolingo-papers/publications/settles.acl16.learning_traces.13m.csv.gz) (361 MB). This is a gzipped CSV file containing the 13 million Duolingo student learning traces used in our experiments.

This is a gzipped CSV file containing the 13 million Duolingo student learning traces used in our experiments. The columns are as follows:
The columns are as follows:

* ``p_recall`` - proportion of exercises from this lesson/practice where the word/lexeme was correctly recalled
* ``timestamp`` - UNIX timestamp of the current lesson/practice session
* ``delta`` - time (in seconds) since the last lesson/practice session that included this word
* ``user_id`` - student user ID (anonymized)
* ``timestamp`` - UNIX timestamp of the current lesson/practice
* ``delta`` - time (in seconds) since the last lesson/practice that included this word/lexeme
* ``user_id`` - student user ID who did the lesson/practice (anonymized)
* ``learning_language`` - language being learned
* ``ui_language`` - user interface language (presumably native to the student)
* ``lexeme_id`` - system ID for the lexeme tag (i.e., word/concept)
* ``lexeme_id`` - system ID for the lexeme tag (i.e., word)
* ``lexeme_string`` - lexeme tag (see below)
* ``history_seen`` - total times user has seen the lexeme tag prior to this session
* ``history_correct`` - total times user has been correct for the lexeme tag prior to this session
* ``session_seen`` - times the user saw the lexeme tag during this session
* ``session_correct`` - times the user got the lexeme tag correct during this session
* ``history_seen`` - total times user has seen the word/lexeme prior to this lesson/practice
* ``history_correct`` - total times user has been correct for the word/lexeme prior to this lesson/practice
* ``session_seen`` - times the user saw the word/lexeme during this lesson/practice
* ``session_correct`` - times the user got the word/lexeme correct during this lesson/practice

The ``lexeme_string`` column contains the "lexeme tag" used by the Duolingo system for each data instance at the time of these experiments. It has been added for this release to facilitate future research and analysis. Only the ``lexeme_id`` column was used in our original experiments. The ``lexeme_string`` field uses the following format:
The ``lexeme_string`` column contains a string representation of the "lexeme tag" used by Duolingo for each lesson/practice (data instance) in our experiments. It has been added for this release to facilitate future research and analysis. Only the ``lexeme_id`` column was used in our original experiments. The ``lexeme_string`` field uses the following format:

```
surface-form/lemma<pos>[<modifiers>...]
```

Where ``surface-form`` refers to the inflected form seen in (or intended for) the exercise, ``lemma`` is the uninflected root, ``pos`` is the high-level part of speech, and each of the ``modifers`` encodes a morphological component specific to the surface form (tense, gender, person, case, etc.).
Where ``surface-form`` refers to the inflected form seen in (or intended for) the exercise, ``lemma`` is the uninflected root, ``pos`` is the high-level part of speech, and each of the ``modifers`` encodes a morphological component specific to the surface form (tense, gender, person, case, etc.). A few examples from Spanish:

Some tags contain wildcard components, written ``<*...>``. For example, ``<*sf>`` refers to a "generic" lexeme without any specific surface form (e.g., a lexeme tag that represents _all_ conjugations of a verb: "run," "ran," "running," etc.), or ``<*numb>`` (e.g., both singular and plural forms of a noun: "teacher" and "teachers"). The file ``lexeme_reference.txt`` contains a reference of pos and modifier components used for lexeme tags.
```
bajo/bajo<pr>
blancos/blanco<adj><m><pl>
carta/carta<n><f><sg>
de/de<pr>
diario/diario<n><m><sg>
ellos/prpers<prn><tn><p3><m><pl>
es/ser<vbser><pri><p3><sg>
escribe/escribir<vblex><pri><p3><sg>
escribimos/escribir<vblex><pri><p1><pl>
lee/leer<vblex><pri><p3><sg>
lees/leer<vblex><pri><p2><sg>
leo/leer<vblex><pri><p1><sg>
libro/libro<n><m><sg>
negra/negro<adj><f><sg>
persona/persona<n><f><sg>
por/por<pr>
son/ser<vbser><pri><p3><pl>
soy/ser<vbser><pri><p1><sg>
y/y<cnjcoo>
```

Some tags contain wildcard components, written as ``<*...>``. For example, ``<*sf>`` refers to a "generic" lexeme without any specific surface form (e.g., a lexeme tag that represents _all_ conjugations of a verb: "run," "ran," "running," etc.). The ``<*numb>`` modifier subsumes both singular and plural forms of a noun (e.g., "teacher" and "teachers"). The file ``lexeme_reference.txt`` contains a reference of pos and modifier components used for lexeme tags.
15 changes: 13 additions & 2 deletions experiment.py
Expand Up @@ -23,7 +23,7 @@


# data instance object
Instance = namedtuple('Instance', 'p t fv h a lang right wrong ts uid'.split())
Instance = namedtuple('Instance', 'p t fv h a lang right wrong ts uid lexeme'.split())


class SpacedRepetitionModel(object):
Expand Down Expand Up @@ -161,6 +161,16 @@ def dump_predictions(self, fname, testset):
pp, hh = self.predict(inst)
f.write('%.4f\t%.4f\t%.4f\t%.4f\t%s\t%s\t%d\n' % (inst.p, pp, inst.h, hh, inst.lang, inst.uid, inst.ts))

def dump_detailed_predictions(self, fname, testset):
with open(fname, 'wb') as f:
f.write('p\tpp\th\thh\tlang\tuser_id\ttimestamp\tlexeme_tag\n')
for inst in testset:
pp, hh = self.predict(inst)
for i in range(inst.right):
f.write('1.0\t%.4f\t%.4f\t%.4f\t%s\t%s\t%d\t%s\n' % (pp, inst.h, hh, inst.lang, inst.uid, inst.ts, inst.lexeme))
for i in range(inst.wrong):
f.write('0.0\t%.4f\t%.4f\t%.4f\t%s\t%s\t%d\t%s\n' % (pp, inst.h, hh, inst.lang, inst.uid, inst.ts, inst.lexeme))


def pclip(p):
# bound min/max model predictions (helps with loss optimization)
Expand Down Expand Up @@ -240,7 +250,7 @@ def read_data(input_file, method, omit_bias=False, omit_lexemes=False, max_lines
fv.append((intern('bias'), 1.))
if not omit_lexemes:
fv.append((intern('%s:%s' % (row['learning_language'], lexeme_string)), 1.))
instances.append(Instance(p, t, fv, h, (right+2.)/(seen+4.), lang, right_this, wrong_this, timestamp, user_id))
instances.append(Instance(p, t, fv, h, (right+2.)/(seen+4.), lang, right_this, wrong_this, timestamp, user_id, lexeme_string))
if i % 1000000 == 0:
sys.stderr.write('%d...' % i)
sys.stderr.write('done!\n')
Expand Down Expand Up @@ -291,3 +301,4 @@ def read_data(input_file, method, omit_bias=False, omit_lexemes=False, max_lines
os.makedirs('results/')
model.dump_weights('results/'+filebase+'.weights')
model.dump_predictions('results/'+filebase+'.preds', testset)
# model.dump_detailed_predictions('results/'+filebase+'.detailed', testset)
Binary file added settles.acl16.pdf
Binary file not shown.

0 comments on commit a910224

Please sign in to comment.