In [1]:
from corpy import Corpy, load_corpy
import glob

### Create a list of textual documents

In [2]:
books_path = './texts/'

In [3]:
book_files = glob.glob(books_path + '*.txt')

In [4]:
book_files

['./texts/sleep.txt', './texts/the_open_boat.txt']

In [5]:
books = []
for bf in book_files:
    with open(bf, 'r') as f:
        books.append(f.read())

### Simple Corpy object

In this case, we use all the default parameters:
- mode: 'word'
- lower: True
- one_document: False
- threshold: None (all of the items will be retained)
- threshold_section: 'first' (no effect, since threshold=None)
- text_sections: (1,) (only one main section)
- text_sections_level: 'item' (division of sections based on items. No effect since there is only one section)
- init_books_seq: 'normal' (documents are taken in the same order as they appear in the input list)
- punct: "'.,!?«»:;()[]-\""

In [6]:
c = Corpy(books)

---

Getting a chunk sequentially (output: code)

In [7]:
print(c.get_chunk(chunk_mode='sequential'))

[31, 32, 13, 1014, 341, 135, 174, 57, 0, 3, 8, 95, 37, 621, 66, 408, 0, 3, 190, 38, 408, 32, 0, 3, 17, 109, 39, 9, 12, 516, 16]


---

Getting a chunk sequentially (output: items)

In [8]:
c.reset_counter()
print(c.get_chunk(chunk_mode='sequential', output_mode='item'))

['this', 'is', 'my', 'seventeenth', 'straight', 'day', 'without', 'sleep', '.', 'i', "'", 'm', 'not', 'talking', 'about', 'insomnia', '.', 'i', 'know', 'what', 'insomnia', 'is', '.', 'i', 'had', 'something', 'like', 'it', 'in', 'college', '-']


---

Getting a chunk sequentially (output: string)

In [9]:
c.reset_counter()
print(c.get_chunk(chunk_mode='sequential', output_mode='string'))

this is my seventeenth straight day without sleep . i ' m not talking about insomnia . i know what insomnia is . i had something like it in college -


---

Getting a chunk in normal mode (output: string)

In [10]:
print(c.get_chunk(chunk_mode='normal', book_sel=0, chunk_sel=0, output_mode='string'))

this is my seventeenth straight day without sleep . i ' m not talking about insomnia . i know what insomnia is . i had something like it in college -


---

Getting a chunk in normal mode, 2nd document and chunk starting at item 100 (output: string)

In [11]:
print(c.get_chunk(chunk_mode='normal', book_sel=1, chunk_sel=100, output_mode='string'))

the boat which here rode upon the sea . these waves were most wrongfully and barbarously abrupt and tall , and each froth - top was a problem in small boat


---

The number of items is 30+1 because the 'last_element' option is True (default)

In [12]:
print(len(c.get_chunk(chunk_mode='normal', book_sel=1, chunk_sel=100, output_mode='item')))

31


---

Getting a chunk in random mode, first (and only) section (output: string)

In [13]:
print(c.get_chunk(chunk_mode='random', output_mode='string'))

amazing . but it helped with what quickly became my nightly routine . after ten minutes of lying near him , i would get out of bed . i would go


### Corpy object with 3 sections (item level)

In [14]:
c = Corpy(books, text_sections=[60,30,10]) #text_sections_level='item' by default

---

Getting a chunk sequentially from the 1st section

In [15]:
print(c.get_chunk(chunk_mode='sequential', section=0, output_mode='string'))

this is my seventeenth straight day without sleep . i ' m not talking about insomnia . i know what insomnia is . i had something like it in college -


Repeating the same operation will give the next sequential chunk (starting from the last element, because last_element=True by default)

In [16]:
print(c.get_chunk(chunk_mode='sequential', section=0, output_mode='string'))

- something like it because i ' m not sure that what i had then was exactly the same as what people refer to as insomnia . i suppose a doctor


---

Getting a chunk sequentially from the 2nd section

In [17]:
print(c.get_chunk(chunk_mode='sequential', section=1, output_mode='string'))

of a scene in the UNK of UNK of seven turned faces , and later a UNK of a top - UNK with a white ball on it that UNK to


UNKs appear because the dictionary is built on the first section by default

### Corpy object with 3 sections (item level) and dictionary built on all the sections

In [18]:
c = Corpy(books, text_sections=[60,30,10], threshold_section='all') #text_sections_level='item' by default

---

Getting a chunk sequentially from the 2nd section

In [19]:
print(c.get_chunk(chunk_mode='sequential', section=1, output_mode='string'))

of a scene in the grays of dawn of seven turned faces , and later a stump of a top - mast with a white ball on it that slashed to


UNKs don't appear anymore because the dictionary is built on all of the sections

### Corpy object with 3 sections (item level), dictionary built on all the sections and threshold=1000 (only the first 100 more frequent items are kept)

In [20]:
c = Corpy(books, text_sections=[60,30,10], threshold=1000, threshold_section='all') #text_sections_level='item' by default

---

Getting a chunk sequentially from the 2nd section

In [21]:
print(c.get_chunk(chunk_mode='sequential', section=1, output_mode='string'))

of a scene in the UNK of UNK of seven turned faces , and later a UNK of a top - UNK with a white UNK on it that UNK to


UNKs appear again because the dictionary has been built only with the first 1000 more frequent items

### For Machine Learning purpose:

By keeping the last_element=True, the input and the target for a ML system can be built as following:

In [22]:
c.reset_counter()
code = c.get_chunk(chunk_mode='sequential', section=1, output_mode='code')
print(code)

[7, 5, 491, 12, 1, 1000, 7, 1000, 7, 380, 221, 608, 2, 4, 403, 5, 1000, 7, 5, 284, 16, 1000, 24, 5, 189, 1000, 23, 9, 14, 1000, 6]


The code length is 30+1:

In [23]:
print(len(code))

31


Then, input can be given by the first 30 items and target by the last 30 items:

In [24]:
inp = code[:-1]
target = code[1:]

In [25]:
print('Input:', inp)
print('Target:', target)

Input: [7, 5, 491, 12, 1, 1000, 7, 1000, 7, 380, 221, 608, 2, 4, 403, 5, 1000, 7, 5, 284, 16, 1000, 24, 5, 189, 1000, 23, 9, 14, 1000]
Target: [5, 491, 12, 1, 1000, 7, 1000, 7, 380, 221, 608, 2, 4, 403, 5, 1000, 7, 5, 284, 16, 1000, 24, 5, 189, 1000, 23, 9, 14, 1000, 6]


Note: code=1000 corresponds to the UNK item

By using the methods code2items or code2strings it is possible to read the sequences:

In [26]:
print(c.code2items(inp))

['of', 'a', 'scene', 'in', 'the', None, 'of', None, 'of', 'seven', 'turned', 'faces', ',', 'and', 'later', 'a', None, 'of', 'a', 'top', '-', None, 'with', 'a', 'white', None, 'on', 'it', 'that', None]


In [27]:
print(c.code2string(inp))

of a scene in the UNK of UNK of seven turned faces , and later a UNK of a top - UNK with a white UNK on it that UNK
