## Wordpiece Model

inspired by https://arxiv.org/pdf/1609.08144.pdf
and https://arxiv.org/abs/1508.07909



In [1]:
import os
import csv

from wordpieces import wordpieces as wp
from wordpieces.wordpieces import WPDict, WPDictBuilder

data_dir    = "c-files"
input_file_name = "linux_kernel_concat.txt"
#input_file_name = "memmgr.c"
input_file  = os.path.join(data_dir, input_file_name)

In [2]:
def readCommentsFileTsv(input_file):
    rows=[]
    with open(input_file, encoding='utf-8') as infile:
        tsvreader = csv.reader(infile, delimiter="\t")
        for row in tsvreader:
            rows.append(row)
    
    return rows;

### Build Dictionary

In [3]:
comments = readCommentsFileTsv(input_file + ".comments.csv");
 
wp_dict_builder = WPDictBuilder()
for comment in comments:
    txt=comment[2]
    wp_dict_builder.learn_sentense(txt)

print (len(wp_dict_builder.stats))

7417


### Keep only 10 000 top wordpeaces

In [4]:
wp_dict = wp_dict_builder.build(5000)
print("dictionary size is %d "%len(wp_dict.stats))

building WPDict of size 5000
dictionary size is 5000 


## test splitting sentenses into wordpieces from dictionary

In [5]:
print(wp_dict.find_longest_chunk("_hema"))

# break_word
print (" ".join(wp_dict.break_word("wordpieces")))
print (" ".join(wp_dict.break_word("artem")))
print (" ".join(wp_dict.break_word("sentenses")))

# break_sentence
print (" ".join(wp_dict.break_sentence("test splitting sentenses into wordpieces from dictionary")))
print (" ".join(wp_dict.break_sentence(comments[0][2])))
print (" ".join(wp_dict.break_sentence(comments[1][2])))
print ("] [".join(wp_dict.break_sentence(comments[2][2])))



('_he', 3)
_wo rd pie ces
_ar tem
_se nte nse s
_te st _sp lit tin g _se nte nse s _in to _wo rd pie ces _fr om _di cti ona ry
_li nux /ke rne l/ acc t.c _BS D _Pr oce ss _Ac cou nti ng _fo r _Li nux _Au tho r: _Ma rco _va n _Wi eri nge n _<m <UNK> <UNK> <UNK> pla net s. el m. net > _So me _co de _ba sed _on _id eas _an d _co de _fr om : _Th oma s _K . _Dy as _<t dy as <UNK> ede n. ru tg ers .e du > _Th is _fi le _im ple men ts _BS D-s tyl e _pr oce ss _ac cou nti ng. _Wh ene ver _an y _pr oce ss _ex its , _an _ac cou nti ng _re cor d _of _ty pe _"s tru ct _ac ct " _is _wr itt en _to _th e _fi le _sp eci fie d _wi th _th e _ac ct( ) _sy ste m _ca ll. _It _is _up _to _us er- lev el _pr ogr ams _to _do _us efu l _th ing s _wi th _th e _ac cou nti ng _lo g. _Th e _ke rne l _ju st _pr ovi des _th e _ra w _ac cou nti ng _in for mat ion . _(C ) _Co pyr igh t _19 95 _- _19 97 _Ma rco _va n _Wi eri nge n _- _EL M _Co nsu lta ncy _B <UNK> V. _Pl ugg ed _tw o _le aks . _1) _It _di dn' t _re tur 

#### Test restoring sentenses from list of wordpieces

In [6]:
sentence=comments[0][2]
print("original:---------------------")
print(sentence)

breaks=list(wp_dict.break_sentence(sentence))
print()
print("breaks:---------------------")
print (" ".join(breaks))

print (len(breaks))
print()
print("restored:---------------------")
print (wp_dict.joinSentence(breaks))

original:---------------------
linux/kernel/acct.c

BSD Process Accounting for Linux

Author: Marco van Wieringen <mvw@planets.elm.net>

Some code based on ideas and code from:
Thomas K. Dyas <tdyas@eden.rutgers.edu>

This file implements BSD-style process accounting. Whenever any
process exits, an accounting record of type "struct acct" is
written to the file specified with the acct() system call. It is
up to user-level programs to do useful things with the accounting
log. The kernel just provides the raw accounting information.

(C) Copyright 1995 - 1997 Marco van Wieringen - ELM Consultancy B.V.

Plugged two leaks. 1) It didn't return acct_file into the free_filps if
the file happened to be read-only. 2) If the accounting was suspended
due to the lack of space it happily allowed to reopen it and completely
lost the old acct_file. 3/10/98, Al Viro.

Now we silently close acct_file on attempt to reopen. Cleaned sys_acct().
XTerms and EMACS are manifestations of pure evil. 21/10/98, AV