# PTB (Penn Tree Bank) dataset introduction

Official page: [Treebank-3](https://catalog.ldc.upenn.edu/ldc99t42)

In Chainer, PTB dataset can be obtained with build-in function.

In [1]:
from __future__ import print_function
import os
import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np

import chainer

## Download PTB dataset

`chainer.datasets.get_ptb_words` method is prepared in Chainer to get PTB dataset.
Dataset is automatically downloaded from https://github.com/tomsercu/lstm/tree/master/data only for the first time, and its cache is used from second time.

In [2]:
train, val, test = chainer.datasets.get_ptb_words()

The dataset structure is `numpy.ndarray`.

`train[i]` represents i-th word in integer, which represents word ID.

In [3]:
print('train type: ', type(train), train.shape, train)
print('val   type: ', type(val), val.shape, val)
print('test  type: ', type(test), test.shape, test)

train type:  <class 'numpy.ndarray'> (929589,) [ 0  1  2 ..., 39 26 24]
val   type:  <class 'numpy.ndarray'> (73760,) [2211  396 1129 ...,  108   27   24]
test  type:  <class 'numpy.ndarray'> (82430,) [142  78  54 ...,  87 214  24]


Each word ID corresponds to specific word or symbol.

Symbol includes following 
 - `<eos>` : end of sequence
 - `<unk>` : unknown word (I guess it is the word which was not in the 10000 vocabulary).
 
The relation between word ID and actual word can be obtained as dictionary with `chainer.datasets.get_ptb_words_vocabulary()` method.

In [5]:
ptb_dict = chainer.datasets.get_ptb_words_vocabulary()
print('Number of vocabulary', len(ptb_dict))
print('ptb_dict', ptb_dict)

Number of vocabulary 10000


### Convert to word sequences

Check original sentense by converting back word ID to word using ptb dictionary.

#### Train text

It is same with [https://raw.githubusercontent.com/tomsercu/lstm/master/data/ptb.train.txt](https://raw.githubusercontent.com/tomsercu/lstm/master/data/ptb.train.txt)

In [14]:
ptb_word_id_dict = ptb_dict
ptb_id_word_dict = dict((v,k) for k,v in ptb_word_id_dict.items())

# Same with https://raw.githubusercontent.com/tomsercu/lstm/master/data/ptb.train.txt
print([ptb_id_word_dict[i] for i in train[:30]])

['aer', 'banknote', 'berlitz', 'calloway', 'centrust', 'cluett', 'fromstein', 'gitano', 'guterman', 'hydro-quebec', 'ipo', 'kia', 'memotec', 'mlx', 'nahb', 'punts', 'rake', 'regatta', 'rubens', 'sim', 'snack-food', 'ssangyong', 'swapo', 'wachter', '<eos>', 'pierre', '<unk>', 'N', 'years', 'old']


In [15]:
# ' '.join() will convert list representation more readable

' '.join([ptb_id_word_dict[i] for i in train[:300]])

"aer banknote berlitz calloway centrust cluett fromstein gitano guterman hydro-quebec ipo kia memotec mlx nahb punts rake regatta rubens sim snack-food ssangyong swapo wachter <eos> pierre <unk> N years old will join the board as a nonexecutive director nov. N <eos> mr. <unk> is chairman of <unk> n.v. the dutch publishing group <eos> rudolph <unk> N years old and former chairman of consolidated gold fields plc was named a nonexecutive director of this british industrial conglomerate <eos> a form of asbestos once used to make kent cigarette filters has caused a high percentage of cancer deaths among a group of workers exposed to it more than N years ago researchers reported <eos> the asbestos fiber <unk> is unusually <unk> once it enters the <unk> with even brief exposures to it causing symptoms that show up decades later researchers said <eos> <unk> inc. the unit of new york-based <unk> corp. that makes kent cigarettes stopped using <unk> in its <unk> cigarette filters in N <eos> altho

#### Validation data text

It is same with [https://raw.githubusercontent.com/tomsercu/lstm/master/data/ptb.valid.txt](https://raw.githubusercontent.com/tomsercu/lstm/master/data/ptb.valid.txt)

In [16]:
print(' '.join([ptb_id_word_dict[i] for i in val[:300]]))

consumers may want to move their telephones a little closer to the tv set <eos> <unk> <unk> watching abc 's monday night football can now vote during <unk> for the greatest play in N years from among four or five <unk> <unk> <eos> two weeks ago viewers of several nbc <unk> consumer segments started calling a N number for advice on various <unk> issues <eos> and the new syndicated reality show hard copy records viewers ' opinions for possible airing on the next day 's show <eos> interactive telephone technology has taken a new leap in <unk> and television programmers are racing to exploit the possibilities <eos> eventually viewers may grow <unk> with the technology and <unk> the cost <eos> but right now programmers are figuring that viewers who are busy dialing up a range of services may put down their <unk> control <unk> and stay <unk> <eos> we 've been spending a lot of time in los angeles talking to tv production people says mike parks president of call interactive which supplied tec

#### Test data text

It is same with [https://raw.githubusercontent.com/tomsercu/lstm/master/data/ptb.test.txt](https://raw.githubusercontent.com/tomsercu/lstm/master/data/ptb.test.txt)

In [17]:
print(' '.join([ptb_id_word_dict[i] for i in test[:300]]))

no it was n't black monday <eos> but while the new york stock exchange did n't fall apart friday as the dow jones industrial average plunged N points most of it in the final hour it barely managed to stay this side of chaos <eos> some circuit breakers installed after the october N crash failed their first test traders say unable to cool the selling panic in both stocks and futures <eos> the N stock specialist firms on the big board floor the buyers and sellers of last resort who were criticized after the N crash once again could n't handle the selling pressure <eos> big investment banks refused to step up to the plate to support the beleaguered floor traders by buying big blocks of stock traders say <eos> heavy selling of standard & poor 's 500-stock index futures in chicago <unk> beat stocks downward <eos> seven big board stocks ual amr bankamerica walt disney capital cities\/abc philip morris and pacific telesis group stopped trading and never resumed <eos> the <unk> has already begu