# Sequencer

## Choosing a suitable text corpus

In my previous project I have optimized the text input by mapping the most frequently used characters in English to the most easily realizable chords. In contrast to the aforementioned optimization process, I have taken a different approach this time.

### Word list with frequencies
At the beginning I have considered using one of the already available data sets, containing a list of words with their frequencies. I received a well-made data set which had the misspelled words removed by cross-referencing a dictionary, but I've ended up using a different text for two reasons:
* The given data did not contain any punctuation marks, yet these have an important role in real texts. 
* I wanted to test the optimizer with more characters than the ones in the English alphabet, moreover I would have appreciated being able to parameterize the number of considered characters in the set.

### Text from a book
Using text from a book seemed to be a good idea, because it would have resulted in data containing punctuation marks. Even with these additional characters I considered the number of characters being modest, therefore I opted for a different data set.

### Using source code
By using the source code of an open source project one can obtain a sufficient number of contained characters. I've downloaded, processed and used the Linux kernel for optimization. The reason behind removing the leading whitespaces was eliminating the characters used for indenting the code.


## Processing the source code

The source code was downloaded to the local machine and then it was extracted. The source code was downloaded to the local machine and then it was extracted. All the files were concatenated, then leading and trailing whitespaces were removed. By indentation usually being an assisted process, considering these characters would have painted a distorted image of the user's typing experience. 

After concatenation, for making the evaluation of a given keyboard faster, n-grams were constructed. Evaluating the keyboard by typing the whole source of the Linux kernel would have been computationally intensive, therefore I have extracted all the sequences with a given length and counted their frequencies. This technique gives a statistical model of the text which can be processed faster, also it can be used to further speed up the evaluation process by ignoring the least frequently appearing sequences. These sequences are called n-grams, where n represents the number of characters in each sequence. I've only managed to construct 3-grams, because increasing the length of the sequences drastically increases the memory usage of the program.

In order to make possible the usage of the case shifting mechanic, uppercase letters were considered as their lowercase counterparts.


## Sequences with highest frequencies

In the following section the sequences with the highest 100 frequencies will be presented. We can observe some unusual sequences appearing more because the syntax of the used programming language. Letters with their occurrences as compiled by Peter Norvig are also presented. For comparison, the graph with the sequences of length 1 was recreated by only showing the letters of the English alphabet.


In [1]:
%cd /home/jovyan/work

from bokeh.io import output_notebook

output_notebook()


/home/jovyan/work


In [2]:
from dataset.common import read_ngram

ngram = read_ngram('dataset/ngram.json')
norvig = read_ngram('dataset/letters_norvig.json')
limited_characters = ngram[0:100]


In [3]:
from bokeh.plotting import figure, show, ColumnDataSource

def present(title, sequences):
    source = ColumnDataSource({
        'x': range(1, len(sequences) + 1),
        'y': [item[1] for item in sequences],
        'text': [separate(item[0]) for item in sequences],
    })

    tooltips = [
        ('rank', '@x'),
        ('occurrences', '@y'),
        ('sequence', '@text'),
    ]

    f = figure(title=title, tools='pan,wheel_zoom,reset', height=600,
               sizing_mode='stretch_width', tooltips=tooltips)

    f.line('x', 'y', line_width=2, source=source)
    f.circle('x', 'y', fill_color='white', size=8, source=source)

    show(f)


def separate(characters):
    joined = '|' + '|'.join(characters) + '|'
    return repr(joined)


In [4]:
sequence1 = limited_characters.symbols(1)[:100]
present('Sequences of length 1', sequence1)

sequence2 = limited_characters.symbols(2)[:100]
present('Sequences of length 2', sequence2)

sequence3 = limited_characters.symbols(3)[:100]
present('Sequences of length 3', sequence3)

letters = norvig.symbols(1)
present('Letters and occurrences compiled by Peter Norvig', letters)

alphabet = [item[0] for item in letters]
letters_from_sequence1= list(filter(lambda x: x[0] in alphabet, sequence1))
present('Letters from sequences of length 1', letters_from_sequence1)


In [5]:
print('Letters ordered by their count, Peter Norvig\'s compilation')
print(' '.join([item[0][0] for item in letters]))

print('Letters ordered by their count, our data set')
print(' '.join([item[0][0] for item in letters_from_sequence1]))

Letters ordered by their count, Peter Norvig's compilation
e t a o i n s r h l d c u m f p g w y b v k x j q z
Letters ordered by their count, our data set
e t i r s a n c d o l p f u m h g x b v k w y q z j
