In [1]:
import pandas as pd
import numpy as np
import processing

In [2]:
log = pd.read_csv('./data/log_valid_processed.csv')

# Known issues

#### ITE preceded by backspaces

Sometimes, a multicharacter keystroke is preceded by quite a few backspaces. This leads to two questions:
1. Is it the user who is explicitly hitting the backspaces, or is it an automatic behaviour registered when an ITE is used? (Probably the latter)
1. If it is caused by using an ITE, which ITE is it caused by: autocorrect or word suggestion?

In [3]:
log.loc[log.ts_id == 15226].iloc[35:53]

Unnamed: 0,ts_id,entry_id,key,lev_dist,text_field,participant_id,len_diff,iki,text_field_prev,is_rep,ite,is_forward,iki_norm,tmp,ite2
41609,15226,5,I,1,The best-of-seaven series moves to I,4568,1,472.0,The best-of-seaven series moves to,False,none,True,472.0,False,
41610,15226,5,n,1,The best-of-seaven series moves to In,4568,1,180.0,The best-of-seaven series moves to I,False,none,True,180.0,False,
41611,15226,5,d,1,The best-of-seaven series moves to Ind,4568,1,259.0,The best-of-seaven series moves to In,False,none,True,259.0,False,
41612,15226,5,i,1,The best-of-seaven series moves to Indi,4568,1,250.0,The best-of-seaven series moves to Ind,False,none,True,250.0,False,
41613,15226,5,a,1,The best-of-seaven series moves to India,4568,1,275.0,The best-of-seaven series moves to Indi,False,none,True,275.0,False,
41614,15226,5,n,1,The best-of-seaven series moves to Indian,4568,1,127.0,The best-of-seaven series moves to India,False,none,True,127.0,False,
41615,15226,5,a,1,The best-of-seaven series moves to Indiana,4568,1,403.0,The best-of-seaven series moves to Indian,False,none,True,403.0,False,
41616,15226,5,p,1,The best-of-seaven series moves to Indianap,4568,1,360.0,The best-of-seaven series moves to Indiana,False,none,True,360.0,False,
41617,15226,5,o,1,The best-of-seaven series moves to Indianapo,4568,1,340.0,The best-of-seaven series moves to Indianap,False,none,True,340.0,False,
41618,15226,5,_,1,The best-of-seaven series moves to Indianap,4568,-1,1071.0,The best-of-seaven series moves to Indianapo,False,none,True,1071.0,False,


#### Two different predict backend behaviours

There are two predict backend behaviours. The first one will register the entire word as the key (e.g. 'guilty'), the second will only register the completed portion (e.g. 'lty'). We analyze assuming the first behaviour, but we should correct the second behaviour to be consistent. Otherwise, many mentrics are incorrecxt (e.g. the word length is currently measured by just taking the length of the key, since it 

In [4]:
log.loc[log.ts_id == 48864].tail(5)

Unnamed: 0,ts_id,entry_id,key,lev_dist,text_field,participant_id,len_diff,iki,text_field_prev,is_rep,ite,is_forward,iki_norm,tmp,ite2
135928,48864,4,u,1,I hope this answer you,14665,1,176.0,I hope this answer yo,False,none,True,176.0,False,
135929,48864,4,r,1,I hope this answer your,14665,1,376.0,I hope this answer you,False,none,True,376.0,False,
135930,48864,-1,,1,I hope this answer your,14665,1,300.0,I hope this answer your,False,none,True,300.0,False,
135931,48864,5,q,1,I hope this answer your q,14665,1,2871.0,I hope this answer your,False,none,True,2871.0,False,
135932,48864,5,uestion,8,I hope this answer your question,14665,8,743.0,I hope this answer your q,False,predict,True,92.875,False,completion


#### Multi-word inputs

Some participants exhibit a keyboard behaviour where the key contains current and previous words that were entered. We are usually able to remove this in the filtering process, but not always. Out of 9'000 post-filtered participants, it seems like 1 participant exhibits this issue.

In [5]:
log.loc[log.ts_id == 1507137].head(8)

Unnamed: 0,ts_id,entry_id,key,lev_dist,text_field,participant_id,len_diff,iki,text_field_prev,is_rep,ite,is_forward,iki_norm,tmp,ite2
5026801,1507137,0,Moreover,41,Moreover,237433,-1,,how can we get these answers.,False,none,True,,False,
5026802,1507137,0,Moreover,0,Moreover,237433,0,115.0,Moreover,False,none,True,14.375,False,
5026803,1507137,1,Moreover we,3,Moreover we,237433,3,64.0,Moreover,False,none,True,5.818182,False,
5026804,1507137,1,Moreover we,0,Moreover we,237433,0,143.0,Moreover we,False,none,True,13.0,False,
5026805,1507137,2,Moreover we have,5,Moreover we have,237433,5,180.0,Moreover we,False,none,True,11.25,False,
5026806,1507137,3,Moreover we have the,4,Moreover we have the,237433,4,205.0,Moreover we have,False,none,True,10.25,False,
5026807,1507137,4,moreover we have the,1,moreover we have the,237433,1,47.0,Moreover we have the,False,none,True,2.238095,False,
5026808,1507137,4,moreover we have the,0,moreover we have the,237433,0,69.0,moreover we have the,False,none,True,3.285714,False,


#### Consistent multicharacter inputs

Some participants have multicharacter inputs at the end of every word (or many words). These inputs have zero LD. This might be a side effect of having autocorrect turned on. Every time the user presses SPACE, the entire word is input, even if it was spelled correctly. This makes it hard to detect 0-LD predictions. We try to filter these out, and it seems to work quite well.

#### Overly broad completion condition

 Because the condition for completion is only that lev_dist == len_diff, then even if a letter is inserted into the middle of the word, this is considered a completion. For example, in the conjunction case: I-l-l becomes I'll. Or g-u-l-t-y becomes guilty. The former case can be argued to be a completion, but the latter is definitely not.

#### Overly strict completion vs. correction classificaiton

If somebody types 'd-e-f-o' and then completes to 'definitely', that's a completion. But currently we mark it as a correction because of the substitution between 'defo' and 'defi'.

#### Incorrect Levenshtein distance

Sometimes the Levenshtein distances appears to be incorrect. So far, this seems to happen in two cases:
1. A backspace is inserted. The edit distance should be 1, but the Levenshtein distance is calculated as zero.
2. The edit distance is completely wrong and is way too high.

In [6]:
log.loc[log.ts_id == 68218].iloc[27:32]

Unnamed: 0,ts_id,entry_id,key,lev_dist,text_field,participant_id,len_diff,iki,text_field_prev,is_rep,ite,is_forward,iki_norm,tmp,ite2
233676,68218,2,s,1,Liverpool always has,19771,1,154.0,Liverpool always ha,False,none,True,154.0,False,
233677,68218,-1,,2,Liverpool always has,19771,1,82.0,Liverpool always has,False,none,True,82.0,False,
233678,68218,2,_,0,Liverpool always has,19771,-1,654.0,Liverpool always has,False,none,True,654.0,False,
233679,68218,2,_,1,Liverpool always ha,19771,-1,133.0,Liverpool always has,False,none,True,133.0,False,
233680,68218,2,d,1,Liverpool always had,19771,1,203.0,Liverpool always ha,False,none,True,203.0,False,


In [7]:
log.loc[log.ts_id == 773949].head(5)

Unnamed: 0,ts_id,entry_id,key,lev_dist,text_field,participant_id,len_diff,iki,text_field_prev,is_rep,ite,is_forward,iki_norm,tmp,ite2
2693245,773949,0,h,1,h,138556,-1,,he is expected to miss six weeks of action due...,False,none,True,,False,
2693246,773949,0,i,1,hi,138556,1,153.0,h,False,none,True,153.0,False,
2693247,773949,0,s,33,his,138556,1,161.0,hi,False,none,True,161.0,False,
2693248,773949,-1,,1,his,138556,1,92.0,his,False,none,True,92.0,False,
2693249,773949,1,m,1,his m,138556,1,535.0,his,False,none,True,535.0,False,
