added ACL paper, minor updates

duolingo · Jun 10, 2016 · a910224 · a910224
1 parent 0dd1342
commit a910224
Show file tree

Hide file tree

Showing 3 changed files with 52 additions and 19 deletions.
diff --git a/README.md b/README.md
@@ -6,7 +6,7 @@ Half-life regression (HLR) is a model for spaced repetition practice, with parti
 
 This repository contains a public release of the data and code used for several experiments in the following paper (which introduces HLR):
 
-> B. Settles and B. Meeder. 2016. A Trainable Spaced Repetition Model for Language Learning.
+> B. Settles and B. Meeder. 2016. [A Trainable Spaced Repetition Model for Language Learning](settles.acl16.pdf).
 > In _Proceedings of the Association for Computational Linguistics (ACL)_, to appear.
 
 When using this data set and/or software, please cite this publication. A BibTeX record is:
@@ -24,36 +24,58 @@ When using this data set and/or software, please cite this publication. A BibTeX
 
 ## Software
 
-The file ``experiment.py`` contains a Python implementation of half-life regression, as well as several baseline spaced repetition algorithms used in Section 4.1 of the paper above. It implemented in pure Python, and we recommend using [pypy](http://pypy.org/) on large datasets for efficiency. The software creates the subfolder ``results/`` which contains model predictions on the test set and induced model weights for inspection.
+The file ``experiment.py`` contains a Python implementation of half-life regression, as well as several baseline spaced repetition algorithms used in Section 4.1 of the paper above. It is implemented in pure Python, and we recommend using [pypy](http://pypy.org/) on large data sets for efficiency. The software creates the subfolder ``results/`` for outputting model predictions on the test partition and induced model weights for inspection.
 
-The file ``evaluation.r`` contains an R function, ``sr_evaluate()``, which takes a prediction file from the script above, and implements the three key metrics we use for evaluation: mean absolute error (MAE), area under the ROC curve (AUC), and Spearman correlation for estimated half-life. Significance tests are also included.
+The file ``evaluation.r`` implements an R function, ``sr_evaluate()``, which takes a prediction file from the script above and implements the three metrics we use for evaluation: mean absolute error (MAE), area under the ROC curve (AUC), and Spearman correlation for estimated half-life. Significance tests are also included.
 
 
-## Dataset and Format
+## Data Set and Format
 
-The dataset is available here: [settles.acl16.learning_traces.13m.csv.gz](https://s3.amazonaws.com/duolingo-papers/publications/settles.acl16.learning_traces.13m.csv.gz) (361 MB).
+The data set is available here: [settles.acl16.learning_traces.13m.csv.gz](https://s3.amazonaws.com/duolingo-papers/publications/settles.acl16.learning_traces.13m.csv.gz) (361 MB). This is a gzipped CSV file containing the 13 million Duolingo student learning traces used in our experiments.
 
-This is a gzipped CSV file containing the 13 million Duolingo student learning traces used in our experiments. The columns are as follows:
+The columns are as follows:
 
 * ``p_recall`` - proportion of exercises from this lesson/practice where the word/lexeme was correctly recalled
-* ``timestamp`` - UNIX timestamp of the current lesson/practice session
-* ``delta`` - time (in seconds) since the last lesson/practice session that included this word
-* ``user_id`` - student user ID (anonymized)
+* ``timestamp`` - UNIX timestamp of the current lesson/practice
+* ``delta`` - time (in seconds) since the last lesson/practice that included this word/lexeme
+* ``user_id`` - student user ID who did the lesson/practice (anonymized)
 * ``learning_language`` - language being learned
 * ``ui_language`` - user interface language (presumably native to the student)
-* ``lexeme_id`` - system ID for the lexeme tag (i.e., word/concept)
+* ``lexeme_id`` - system ID for the lexeme tag (i.e., word)
 * ``lexeme_string`` - lexeme tag (see below)
-* ``history_seen`` - total times user has seen the lexeme tag prior to this session
-* ``history_correct`` - total times user has been correct for the lexeme tag prior to this session
-* ``session_seen`` - times the user saw the lexeme tag during this session
-* ``session_correct`` - times the user got the lexeme tag correct during this session
+* ``history_seen`` - total times user has seen the word/lexeme prior to this lesson/practice
+* ``history_correct`` - total times user has been correct for the word/lexeme prior to this lesson/practice
+* ``session_seen`` - times the user saw the word/lexeme during this lesson/practice
+* ``session_correct`` - times the user got the word/lexeme correct during this lesson/practice
 
-The ``lexeme_string`` column contains the "lexeme tag" used by the Duolingo system for each data instance at the time of these experiments. It has been added for this release to facilitate future research and analysis. Only the ``lexeme_id`` column was used in our original experiments. The ``lexeme_string`` field uses the following format:
+The ``lexeme_string`` column contains a string representation of the "lexeme tag" used by Duolingo for each lesson/practice (data instance) in our experiments. It has been added for this release to facilitate future research and analysis. Only the ``lexeme_id`` column was used in our original experiments. The ``lexeme_string`` field uses the following format:
 
 ```
 surface-form/lemma<pos>[<modifiers>...]
 ```
 
-Where ``surface-form`` refers to the inflected form seen in (or intended for) the exercise, ``lemma`` is the uninflected root, ``pos`` is the high-level part of speech, and each of the ``modifers`` encodes a morphological component specific to the surface form (tense, gender, person, case, etc.).
+Where ``surface-form`` refers to the inflected form seen in (or intended for) the exercise, ``lemma`` is the uninflected root, ``pos`` is the high-level part of speech, and each of the ``modifers`` encodes a morphological component specific to the surface form (tense, gender, person, case, etc.). A few examples from Spanish:
 
-Some tags contain wildcard components, written ``<*...>``. For example, ``<*sf>`` refers to a "generic" lexeme without any specific surface form (e.g., a lexeme tag that represents _all_ conjugations of a verb: "run," "ran," "running," etc.), or ``<*numb>`` (e.g., both singular and plural forms of a noun: "teacher" and "teachers"). The file ``lexeme_reference.txt`` contains a reference of pos and modifier components used for lexeme tags.
+```
+bajo/bajo<pr>
+blancos/blanco<adj><m><pl>
+carta/carta<n><f><sg>
+de/de<pr>
+diario/diario<n><m><sg>
+ellos/prpers<prn><tn><p3><m><pl>
+es/ser<vbser><pri><p3><sg>
+escribe/escribir<vblex><pri><p3><sg>
+escribimos/escribir<vblex><pri><p1><pl>
+lee/leer<vblex><pri><p3><sg>
+lees/leer<vblex><pri><p2><sg>
+leo/leer<vblex><pri><p1><sg>
+libro/libro<n><m><sg>
+negra/negro<adj><f><sg>
+persona/persona<n><f><sg>
+por/por<pr>
+son/ser<vbser><pri><p3><pl>
+soy/ser<vbser><pri><p1><sg>
+y/y<cnjcoo>
+```
+
+Some tags contain wildcard components, written as ``<*...>``. For example, ``<*sf>`` refers to a "generic" lexeme without any specific surface form (e.g., a lexeme tag that represents _all_ conjugations of a verb: "run," "ran," "running," etc.). The ``<*numb>`` modifier subsumes both singular and plural forms of a noun (e.g., "teacher" and "teachers"). The file ``lexeme_reference.txt`` contains a reference of pos and modifier components used for lexeme tags.
diff --git a/experiment.py b/experiment.py
@@ -23,7 +23,7 @@
 
 
 # data instance object
-Instance = namedtuple('Instance', 'p t fv h a lang right wrong ts uid'.split())
+Instance = namedtuple('Instance', 'p t fv h a lang right wrong ts uid lexeme'.split())
 
 
 class SpacedRepetitionModel(object):
@@ -161,6 +161,16 @@ def dump_predictions(self, fname, testset):
                 pp, hh = self.predict(inst)
                 f.write('%.4f\t%.4f\t%.4f\t%.4f\t%s\t%s\t%d\n' % (inst.p, pp, inst.h, hh, inst.lang, inst.uid, inst.ts))
 
+    def dump_detailed_predictions(self, fname, testset):
+        with open(fname, 'wb') as f:
+            f.write('p\tpp\th\thh\tlang\tuser_id\ttimestamp\tlexeme_tag\n')
+            for inst in testset:
+                pp, hh = self.predict(inst)
+                for i in range(inst.right):
+                    f.write('1.0\t%.4f\t%.4f\t%.4f\t%s\t%s\t%d\t%s\n' % (pp, inst.h, hh, inst.lang, inst.uid, inst.ts, inst.lexeme))
+                for i in range(inst.wrong):
+                    f.write('0.0\t%.4f\t%.4f\t%.4f\t%s\t%s\t%d\t%s\n' % (pp, inst.h, hh, inst.lang, inst.uid, inst.ts, inst.lexeme))
+
 
 def pclip(p):
     # bound min/max model predictions (helps with loss optimization)
@@ -240,7 +250,7 @@ def read_data(input_file, method, omit_bias=False, omit_lexemes=False, max_lines
             fv.append((intern('bias'), 1.))
         if not omit_lexemes:
             fv.append((intern('%s:%s' % (row['learning_language'], lexeme_string)), 1.))
-        instances.append(Instance(p, t, fv, h, (right+2.)/(seen+4.), lang, right_this, wrong_this, timestamp, user_id))
+        instances.append(Instance(p, t, fv, h, (right+2.)/(seen+4.), lang, right_this, wrong_this, timestamp, user_id, lexeme_string))
         if i % 1000000 == 0:
             sys.stderr.write('%d...' % i)
     sys.stderr.write('done!\n')
@@ -291,3 +301,4 @@ def read_data(input_file, method, omit_bias=False, omit_lexemes=False, max_lines
         os.makedirs('results/')
     model.dump_weights('results/'+filebase+'.weights')
     model.dump_predictions('results/'+filebase+'.preds', testset)
+    # model.dump_detailed_predictions('results/'+filebase+'.detailed', testset)
diff --git a/settles.acl16.pdf b/settles.acl16.pdf