/
OLI Log Processing.html
executable file
·534 lines (528 loc) · 26 KB
/
OLI Log Processing.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="content-type">
<title>ANDES Log Processing</title>
</head>
<body>
<h1>Processing ANDES Logs from OLI</h1>
<br>
This document explains how to obtain Andes log sets from OLI, anonymize
them, and convert them into datashop format. All scripts for dealing
with log files can be found in the LogProcessing directory of the Andes
tree.<br>
<br>
<h2>Obtaining Logs from OLI</h2>
OLI logs can be retreived from the oli QA server,
oli-qa.andrew.cmu.edu, using OLI's <span style="font-style: italic;">Data
Extraction tool.</span> OLI will have to give you an account and
configure your access to the data extraction tool, after which a link
for it will show up as a long on the top left of your start page when
you log in to OLI.<br>
<br>
Logs on the QA server are mirrors of those on the OLI production
server. These mirrors are made periodically, typically when QA software
is updated and at other irregular intervals. So, they typically lag the
data on production by a few weeks. It is not allowed to retrieve logs
from the production server since the process can overload the server. <br>
<br>
For courses hosted on the PSLC server, the Data Extraction tool may be
made available on the live server at any time.<br>
<br>
<h2>Running the Data Extraction tool</h2>
After clicking the Data Extraction tool, you are walked through a
series of screens to define a query. You should make the following
selections:<br>
<br>
Step 1: Select scope of data<br>
Go down to "Course Sections" and select the
course sections you want. Generally, download one section at
a time, so that section information is not lost in the DataShop.
The available ones for you should be
highlighted. OLI may have to configure your access to particular course
logs.
Otherwise you will see them but not be able to select them in the Data
Extraction tool interface. <br>
Do not select anything else on this page. You do not
normally want to select by content package, for example.<br>
Click Go to Step 2<br>
<br>
Step 2: Options<br>
<span
style="font-weight: bold;">Columns</span>: This selects which columns
from the OLI log database will be included in the output in a
comma-separated format. The essential one is the "info" column. The
info column records application-specific information. In our
case, there a single row in the OLI log database corresponding to
the event of Andes uploading a log file at the end of an Andes problem
session. This row includes the entire text of an Andes session logfile
in the "info" field, so info should always remain checked. <br>
One should also check the "Action
TimeStamp" to receive the server's time at the time the log was
made. This is useful to have because the date and time shown<span
style="font-style: italic;"> inside</span> the Andes logs is derived
from the user's system clock, which may be incorrectly set. So the OLI
recorded event time is more reliable to have. This time should be very
close to the time of the <span style="font-style: italic;">end </span>of
the Andes session, when the log file was uploaded. <br>
One may check the Action
Time Zone for a more complete specification about the time.
However, since this event is recorded by the OLI server, this
does not really add information -- the times here should always be in
the Eastern time zone where the OLI servers are, even if the students
are working in different time zones. <br>
All other columns should
be unchecked to keep the output simple.<br>
<br>
Notes: In theory
there might be some use for the User Id column, which will show an
anonymized OLI user id (into a long ugly GUID), or the Session GUID,
which identifes an OLI login session. This detail could be used to
correlate Andes logs with other OLI, non-ANDES log entries made by the
same student or within the same OLI login session. For example, one can
also use the Data Extraction tool to retrieving logs showing access to
learning pages in the course. With that result, one could determine
which learning pages the student visited in the same session as the
Andes log was made. However, we have never made any use of this
information up to now. <br>
<br>
The question has sometimes come up as to whether we
can determine when students view training videos on OLI from their
logs. Right now, there is no record made in the OLI log database of
these events, because they are just serves of webcontent that do
not pass through the OLI courseware system. On way around this would be
to wrap each video in its own learning page. In this case there <span
style="font-style: italic;">would </span>be a log of those learning
page accesses. However, that would not show whether the student had
launched the video or not, merely whether they had opened the page on
which it resides. Another possible way might be to convert the videos
to Flash format, and make use of Flash media logging support built into
certain OLI tools. This would log media viewing events in the
DataShop's xml notation, and these events could then be retrieved <br>
<br>
<span
style="font-weight: bold;">Actions</span>: Select Andes. This is a
custom action which indicates uploading an Andes log at the end of an
Andes problem session.<br>
<br>
<span
style="font-weight: bold;">Filters</span>: leave blank<br>
<br>
<span
style="font-weight: bold;">Options</span>: Normally one should
select a date range which includes all logs in the course but exclude
other semesters that might contain the same student. The dates do
not have to be exact . For example, for a Spring term 2007
course, it could suffice to select a range from 2007-01-01 to
2007-06-01, even if the exact course date range was somewhat narrower. <br>
<br>
The reason for this is
that OLI does not actually record course information in the log
database. Rather, when you request logs for a course, the Data
Extraction server looks up the course roster and translates this into a
query for logs by all students on that roster. If a student was in both
a fall term and spring term course, this will retrieve logs from
outside of the course. Using a restricted date range will limit this to
student logs from the time the course was running.<br>
<br>
All other options should
be left as defaults.<br>
<br>
Note: if a student did
work under their own OLI id in other courses than the desired one, for
example, in the open and free course, these logs will also be included
in the set. It is not easy to eliminate these since there is no record
in the OLI database of the course from which a log was saved. However,
the activity GUIDs which identify particular problems are specific to a
course: problem KT1A in one course will be identified by a different
activity GUID than KT1A in a different course. So in theory one could
correlate activity GUIDs with those used in the course to identify the
particular course. However, activity GUIDs are not currently recorded
when Andes logs are saved. Andes would have to be modified to stamp the
activity GUID into the log itself for this. <br>
<br>
At Step 3, the query is submitted and you should start downloading a
zip file named data.zip. However, this step often times out or fails if
the server is heavily
loaded. In that case you just have to try again at a different time.<br>
<br>
The data.zip itself contains a single file named
actions.csv, with all the data. Save this zip file with some
identifying name like Fall2007-USNA.dat.z <br>
<h4>Concatenated log format</h4>
The embedded file you get is formally a csv file
showing selected rows in a database, like a
spreadsheet. The first line gives the column headings and subsequent
lines give comma-separated values. As noted above, the
last "column" of each line is the full Andes log, which normally
includes multiple text lines within it. Effectively that makes
this file a concatenated sequence of Andes logs, with a
little bit of cruft before the initial log header line. <br>
<br>
For most purposes the concatenated file can be
processed as a unit. By using zcat and piping output to further
processing, it may not even necessary to uncompress it. However, it
could also be split into separate log files for processing
-- see the splitlogs.pl tool in the Andes LogProcessing directory. <br>
<br>
Within the concatenated log, individual session logs
can be found by searching for the Andes log header line, which will
have the following format:<br>
<br>
2008/04/03 18:15:21,Eastern,# Log of Andes session begun Thursday,
April 03, 2008 18:14:50 by [user] on [computer]<br>
<br>
Individual logs should each end with an END-LOG line<br>
1:00 END-LOG <br>
However, in theory an error could prevent the
END-LOG statement from getting into the log, so it is safest just to
end a log at the header of the next one.<br>
<br>
Technically this file is in Unix text file
format, in which there is a single newline characters at the end of
each row from the database. However, the last column of all the rows
after the first, header line, end with the full text of an Andes log
file, and an Andes log file is tself a multi-line Windows-format
text file in which the lines end with carriage-return linefeed pairs.
One effect of this is that when viewed in a text editor, there will
appears to be a blank line after each log file. <br>
<h2><span style="font-weight: bold;">Anonymizing the logs</span></h2>
<h2><span style="font-weight: bold;"></span></h2>
<br>
There are different scripts for anonymizing USNA logs, where our
students are supposed to use their alphas, aka mid numbers, as user
ids, and anonymizing arbitrary student logs, where the ids may take any
form.<br>
<h5>Anonymizing USNA midshipmen logs: mkanon-usna.pl<br>
</h5>
To anonymize mid logs, obtain the file mkanon-usna.pl.template. Edit it
to include a different hash function in the munge_id routine to map the
mid number into some anonymization code. Rename it to makanon-usna.pl.
The basic way to anonymize is then:<br>
<br>
zcat Fall2007-USNA.dat.z |
perl mkanon-usna.pl > Fall2007-USNA-anon.dat 2> idmap.txt<br>
<br>
The standard output will be the anonymized concatenated log
(uncompressed). The stderr output will contain any error messages, of
which the most important is reports of ids that do not conform to the
mid number patten. This will be followed by a tab-delimited listing of
the id mapping file generated. This mapping should be saved reference
and for use in applying the mapping to other files. So you should
inspect idmap first, and edit out any error message before saving it.
[Maybe better to just write idmap.txt file separately from error
messages? ]<br>
<br>
In this script, any ids that do not have the pattern of a mid number
will be passed through unchanged. Normally such ids are limited to
instructor or TA accounts. So, these untranslated ids should then be
identifiable in the anonymized output by the fact that they differ in
form from all the student ids. <br>
<p>The user id appears twice in each session: in cases where there
is some ambiguity, it can be useful to compare the two:
<pre>
cat log/Spring2009-USNA.dat | perl -ne 'm/# Log .* by (.+) on / and print "$1 "; m/read-student-info "(.*)"/ and print "$1\n"' | sort | uniq
</pre>
<h5>Using OLI class rosters</h5>
It is generally useful to save copies of the OLI roster tables. OLI
does not provide any method to export the rosters, so the method for
this is just to view the roster page in a browser, select every line in
the roster table display, copy the selection to the clipboard, then
paste it into a text file. Save the file with a name of the form
SECTIONID-roster.txt, e.g. "treacy-3346-roster.txt" or
"gershman-honors-roster.txt" for use with the DataShop conversion
process. <br>
<br>
Note: Some versions of the roster display contain check boxes in the
left-most column. These can safely be included in the selection: when
pasted into a text file, they will simply create an empty column, which
is harmless. However, if you try pasting directly into
Excel, the
check boxes and other formatting can cause trouble. That is why it is
easiest to go via a text file.<br>
<br>
The text file may be usable as is as a space-delimited file. However,
it is more reliable to convert it to a tab-delimited format, because
some of the fields might contain spaces. For example, one student used
a full name with a space as an OLI user id. Also, the first or
last name column might contain a space if a student uses a name like
"Van Helsing".<br>
<br>
One way to convert the text file rosters into a tab-delimited table by
copying the text from the file and pasting into Excel. Because these
files are delimited by spaces, Excel will not automatically tabulate
this data, but will place it all in the leftmost column. While this
pasted data is still selected, select "Tools/Text to Columns" to
convert this into spreadsheet data, selecting spaces and commas as the
delimiters. Excel will now display the data in columns. This can then
be saved in a tab-delimited format. You should eyeball the data
and fix up any rows that contain extra columns because of extra spaces
in the source.<br>
<br>
One thing to watch out for when passing ids through Excel is that by
default Excel will strip initial 0's when it writes out apparently
numeric data. Mid numbers for some years begin with 0's, so this can be
a problem, although mid numbers for future years will not (Mid numbers
begin with the two-digit graduation year of the class who are almost
all sophomores, so in Fall 08 most student ids will begin with 11).<br>
<h5>Merging or fixing student ids</h5>
A nuisance is that sometimes the same real student will have created
two or more different OLI ids. Also, a student may not have used a mid
number at all as an id. Also, instructors may have enrolled in the
course under special ids which might need to be treated specially. <br>
<br>
Normally we ask instructors to grant us TA
access to courses we
support so we can inspect the gradebook and rosters. So the usual way
to identify such students is to inspect the OLI section rosters, which
will show students with the same real name on different lines. <br>
<br>
Note the anonymization for USNA mid ids will already merge students who
used two user
ids consisting of a mid number with and without an initial "m" since it
anonymizes based on the mid number itself, regardless of prefix. So no
special fixup is
required for these cases.<br>
<br>
Both these issues can be handled by creating a file named merge-ids.txt
in the directory in which the script is run. This should be
tab-separated two column listing mapping login ids to canonical
user ids: instances of the first id will be mapped to the second before
anonymizing. The second id need never have been actually used, e.g. for
someone who used a non-mid number id, you could just map it to their
mid
number, even if they never logged in under that number. <br>
<br>
When merging ids into a mid number, it doesn't matter if the initial
"m" is included or not. By convention I haven't been omitting the
initial m's here on the idea that the canonical id is the number.<br>
<h6>Applying the id map to other files: mapid-usna.pl<br>
</h6>
The mapping defined in the saved file idmap.txt can be applied to other
text files, e.g.a list of questionnaire respondants or students in some
experimental condition, by using the script mapid-usna.pl.<br>
<br>
mapid-usna.pl < infile > outfile<br>
<br>
This script looks for a file named idmap.txt in the current directory.
This also should be customized from the template file to include the
same mapping function used in mkanon-usna. This will apply the map to
mid numbers in the If an id is not found in the map, it will use
the hashing algorithm to generate one, generating an error message. <br>
<br>
The
advantage of re-using the log-generated mapping file is that it
gives a message when it encounters a user id for which no logs were
found in
the set. This may indicate an anomaly requiring investigation, e.g. a
questionnaire respondant who entered his user id incorrectly.<br>
<br>
The file merge-ids.txt will also be applied to the text being
processed. This can lead to spurious reports of ids not in the
map which may be ignored. For example, suppose the id "maverick" was
used by a mid whose alpha is m091234. Then merge-ids.txt will contain a
line mapping "maverick" to "091234" (with or without the m), so scripts
will treat "maverick" as if it were "091234". The anonymization map
idmap.txt will then wind up with an entry mapping "maverick" via this
intermediary to some hexadecimal number, say "DC954". However, if you
anonymize some other table containing the true mid number "091234" via
the idmap, it will report that there is no entry for that. This is
harmless because application of the hashing algorithm will come up with
the same anonymous id, namely DC954.<br>
<h5>Anonymizing arbitrary logs</h5>
Student ids in non-USNA datasets can be anonymized with the mkanon.pl
and mapid.pl scripts. These don't use a hashing algorithm. Rather, they
simply generate ids by adding numbers to a specified prefix. To use
these, create a tab-separated mapping file (of any name)
initialized to contain a special mapping for PREFIX, e.g<br>
<br>
PREFIX\tWH081<br>
<br>
Then do<br>
<br>
perl mkanon.pl
mapfile.txt < log > newlog<br>
<br>
where mapfile.txt is the mapping file. This will update mapfile.txt
with the new mapping file. <br>
<br>
This script is designed to be run and re-run multiple times while both
reading and updating the mapping file. In each case, an existing
mapping will
be used if found in the mapfile. Otherwise the mapping file will be
updated with any new
entries at the end. <br>
<h5>Applying the id map to other files: mapid.pl</h5>
It is harder to anonymize arbitrary text because with arbitrary ids,
there is no easy way to tell which piece of the text is a student id --
though it is unusual, arbitrary student ids can even contain spaces.
However, the most common text needing anonymization is a spreadsheet of
information containing student ids in one column. The script
mapidcol.pl can anonymize one column of a spreadsheet given in a tab
delimited format. Usage is<br>
<br>
perl mapidcol.pl mapfile.txt column-number < old
> new<br>
<br>
where column number is the one-based index of the id column in the
tab-delimited file. <br>
<br>
The mapping can also be applied to a file containing list of ids using
the (older) mapid script:<br>
<br>
perl mapid.pl mapfile.txt < old >
new<br>
<br>
This script expects to operate on a file containing a list of student
ids, one to a line. It will match id patterns on the whole line, so ids
may contain spaces. <br>
This provides an alternate way to anonymize a spreadsheet column: cut
the whole column from the spreadsheet, paste it into a text file, apply
the mapid script to that text file, saving the result into a new text
file, selecting all the text in the new text file, and pasting it into
the id column in the original spreadsheet. <br>
<br>
Both these scripts report if they have to generate new ids, and update
the file mapfile.txt with new mappings generated. However, because a
missing mapping may indicate some error, it is a good idea to back up
the idmap file after an earlier stage in the process.<br>
<h2>Converting Raw Logs to DataShop XML Format</h2>
The script log2xml.pl can be used to convert an anonymized dataset into
files in the xml format the DataShop uses for import into their
database. This script reads standard input and takes no command line
arguments (but does need files, see below). With compressed raw logs,
usage is typically of the form<br>
<br>
zcat Fall2008-USNA.dat.z | perl log2xml.pl<br>
<br>
By the datashop's request, this will create one folder for each
student's logs, and one xml log file per session log. <br>
<br>
At the end of the
conversion, this set of directories should be zipped up and delivered
to the DataShop for
dropoff. The DataShop should give you ssh access to a dropbox location
on <code>the-cooker.pslc.cs.cmu.edu</code> for this purpose.<br>
<h5>Info files</h5>
Although the basic conversion is simple, the converter makes use of
several external files to patch in information not included in the
Andes raw logs. It will look for them in a subdirectory named "Info"
within its working directory when run. <br>
<br>
Required:<br>
dataset.txt -- contains dataset name <br>
Optional:<br>
classmap.txt -- student to class id mapping. <br>
class-XXX.xml -- class XML element for class w/id XXX<br>
conditionmap.txt -- student to condition id mapping<br>
condition-XXX.xml -- condition element for id XXX<br>
unitmap.txt -- problem to unit mapping<br>
<br>
The dataset.txt file must exit. All others are optional, if not found,
this info will not be included in the conversion.<br>
<br>
Samples of these files can be viewed in the directory Info.sample in
the LogProcessing directory. <br>
<h6>classmap.txt: student to class id mapping<br>
</h6>
The classmap.txt file is a two-column tab-separated mapping of student
names to some string identifying the class they were in. The string
itself will never be shown to users so is completely arbitrary. It
could be
a section number like "3346" or it could just be an instructor name
like "gershman". <br>
<br>
The classmap file can be generated from roster files by using the
simple perl script rosters-to-classmap.pl as follows:<br>
<br>
perl rosters-to-classmap.pl *-roster.txt >
raw-classmap.txt<br>
<br>
This writes a two-column tab-delimited output with the penultimate
column of each line in the roster file (which should be the OLI user
id) followed by the file name with "-roster.txt" removed. <br>
<br>
This output is in the correct format for a classmap file, but is
not anonymized. For mid ids, the whole file can be anonymized directly
by mapid-usna.pl, since it can recognize which parts of the text have
the form of mid ids. <br>
<br>
For non-mid-ids, the table can be anonymized by the mapidcol.pl script,
as<br>
<br>
perl mapidcol.pl idmap.txt 1 < raw-classmap.txt
> classmap.txt<br>
<br>
Or use the one-liner:
<pre>
perl -p -e 's/^.*?\t(\w+)/$1\tUSNA/' log/Spring2009-USNA-mapping.txt > Info/classmap.txt
</pre>
<h6>Class information</h6>
The Info file class-XXX.xml specifies the full xml element to patch
into the conversion output for specifying the class of students mapped
to XXX via the classmap. These contents are what datashop users
will actually see. For
example:<br>
<br>
<class><br>
<name>SP212 General Physics II</name><br>
<school>USNA</school><br>
<period>6546</period><br>
<description>Electricity and
Magnetism Spring 2007</description><br>
<instructor>Mary Wintersgill</instructor><br>
</class> <br>
<br>
If any of this information is not known, then it need not be included. <br>
<h6>Condition information</h6>
Experimental condition information may come from two sources: it may be
included in the logs, if OLI had been customized for different
conditions, as in Sandy Katz's USNA experiments. In this case the
condition will always be one of "control", "experiment", "experiment1"
or "experiment2", which are the only possible values defined in our OLI
course. In this case no conditionmap file is needed.<br>
<br>
Alternatively, the condition information may be patched in via a
conditionmap.txt file, which maps student ids to experiment condition
ids. This is more common for logs from lab studies. <br>
<br>
Again, the ids used here are arbitrary. The xml element
condition-XXX.xml will be patched into the converted log to explain the
student's experiment condition. A sample is<br>
<br>
<condition><br>
<name>Katz Reflective
Dialogue</name><br>
<type>Experimental</type><br>
</condition><br>
<br>
Note a conditionmap will take priority over conditions found in the
logs. It is not required to include any condition information.<br>
<br>
I believe the datashop format does not support multiple conditions in a
log, even though a USNA student might have participated in multiple
experiments, say a lab study and also a longer study like Sandy Katz's.
<br>
<h6>Unit information</h6>
The converter also makes use of a mapping from problem id to problem
set, e.g kt1 -> Translational Kinematics, since this information is
not in the logs. This is contained in the file unitmap.txt. This file
can be generated from a set of aps files by the script
mkunitmap.pl as<br>
<br>
mkunitmap.pl *.aps > unitmap.txt<br>
<br>
This changes rarely so it can usually just be copied. However, as new
problems are added to Andes it might need to be regenerated. <br>
<br>
<br>
<br>
</body>
</html>