# PennTreeBank: WSJ and Brown Conversion to Dependency Parsing

We will use the LTH tool from http://nlp.cs.lth.se/software/treebank_converter/. All of the hard work is done by them.  This is just a convenient notebook for those who just want the WSJ and Brown corpora ready with a Dependency Parsing format.

**Why no ATIS or SWBD?** Because it is not supported by the tool.

This notebook will proceed to create a Dependency Parsed version of the WSJ and Brown corpora and then will process them to create Numpy representations of the Adjacency Matrices defined by the hierarchy.

## Get yourself a PennTreeBank

Make sure to get the file `penn_treebank_3.zip`. It is out there on the Internet.

## Unzip it!


In [None]:
!unzip penn_treebank_3.zip

## Get LTH (http://nlp.cs.lth.se/software/treebank_converter/)

In [None]:
!wget http://fileadmin.cs.lth.se/nlp/software/pennconverter/pennconverter.jar -q --show-progress



## Converting files

LTH works only with the .mrg files under `parsed/mrg`. Both .prd and .pos files won't work with it. The following script will output .pd files with Dependency Parsing format under directory `parsed/pd`, with the same structure as the original files.

In [None]:
import subprocess
from  os import listdir, mkdir
from os.path import join, isdir
from tqdm.notebook import tqdm

"""

 path: path to treebank_3 folder
 corpus_name: one of "wsj" or "brown"

"""

def convert_corpus(path, corpus_name="wsj", converter_path="pennconverter.jar"):
    fullpath = join(path, "parsed/mrg" , corpus_name)

    if not isdir(join(path,"parsed","pd")):
        mkdir(join(path,"parsed","pd"))
    if not isdir(join(path,"parsed","pd",corpus_name)):
        mkdir(join(path,"parsed","pd",corpus_name))
    for subdir in tqdm(listdir(fullpath), desc="Total Progress"):
        
        if ".LOG" not in subdir and "r" != subdir:
            output_dir = join(path,"parsed","pd",corpus_name,subdir)
            if not isdir(output_dir):
                mkdir(output_dir)
            subdirpath = join(fullpath,subdir)
            for sample in tqdm(listdir(subdirpath), desc="Folder {}/{}".format(corpus_name,subdir)):
                
                command = "java -jar {}{} < {} > {}"
                input_file = join(subdirpath,sample)
                output_file = join(output_dir,sample.split(".")[0] + ".pd")
                if corpus_name == "brown":
                    parameter = " -rightBranching=false"   
                else:
                    parameter = ""
                command = command.format("pennconverter.jar",parameter, input_file,output_file)
                r = subprocess.run(command, shell=True) # run java command for each file

### WSJ (~20 min)

In [None]:
convert_corpus("treebank_3", "wsj")

HBox(children=(FloatProgress(value=0.0, description='Total Progress', max=27.0, style=ProgressStyle(descriptio…

HBox(children=(FloatProgress(value=0.0, description='Folder wsj/19', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/06', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/23', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/24', max=55.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/00', max=99.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/13', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/05', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/14', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/20', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/02', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/21', max=73.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/11', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/07', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/15', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/04', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/01', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/12', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/09', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/10', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/08', max=21.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/03', max=81.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/16', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/22', max=83.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/17', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/18', style=ProgressStyle(description_width='in…





### Brown (~5 min)

In [None]:
convert_corpus("treebank_3", "brown")

HBox(children=(FloatProgress(value=0.0, description='Total Progress', max=8.0, style=ProgressStyle(description…

HBox(children=(FloatProgress(value=0.0, description='Folder brown/cg', max=36.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Folder brown/cp', max=28.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Folder brown/cr', max=9.0, style=ProgressStyle(descriptio…




HBox(children=(FloatProgress(value=0.0, description='Folder brown/ck', max=28.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Folder brown/cm', max=6.0, style=ProgressStyle(descriptio…




HBox(children=(FloatProgress(value=0.0, description='Folder brown/cn', max=29.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Folder brown/cl', max=24.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Folder brown/cf', max=32.0, style=ProgressStyle(descripti…





### Creating Pure Text data and Dependency Graphs in Numpy

This part will create a pure text version of the dataset as well as create Numpy dependency graphs of the same sentence. They go into .txt and .npy files respectively in the `parsed/pd` directory.

In [None]:
import numpy as np


def write_sentences_to_file(s, output_file):
    f = open(output_file.split(".")[0] + ".txt", "w")
    for sentence in s:
        f.write(sentence + "\n")
    f.close()

def create_matrices(path, corpus_name="wsj"):
    fullpath = join(path, "parsed/mrg" , corpus_name)

    if not isdir(join(path,"parsed","pd")):
        mkdir(join(path,"parsed","pd"))
    if not isdir(join(path,"parsed","pd",corpus_name)):
        mkdir(join(path,"parsed","pd",corpus_name))
    for subdir in tqdm(listdir(fullpath), desc="Total Progress"):
        
        if ".LOG" not in subdir and "r" != subdir:
            output_dir = join(path,"parsed","pd",corpus_name,subdir)
            if not isdir(output_dir):
                mkdir(output_dir)
            subdirpath = join(fullpath,subdir)
            for sample in tqdm(listdir(subdirpath), desc="Folder {}/{}".format(corpus_name,subdir)):
            
                output_file = join(output_dir,sample.split(".")[0] + ".pd")
                res, s = create_adjacency_matrix(output_file)
                np.save(output_file.split(".")[0]+".npy", res)
                write_sentences_to_file(s, output_file)

def get_matrix(samples):
    r = []
    s = []
    for i, sample in enumerate(samples):
        n = len(sample)
        adjacency = np.zeros((n,n))
        sentence = " ".join([x[0] for x in sample])
        s.append(sentence)
        for j, (word, parent) in enumerate(sample):
            if parent != 0: # Root word
                adjacency[j][parent-1] = 1.0
        r.append(adjacency)
    return r, s

def create_adjacency_matrix(filename):
    f = open(filename,"r")
    data = f.readlines()
    f.close()
    samples = []
    sample = []
    for line in data:
        line = line.rstrip()
        if line != "":
            line = line.split("\t")
            sample.append([line[1], int(line[6])]) # Get word and parent id
        else:
            samples.append(sample)
            sample = []
        m, s = get_matrix(samples)
    return np.array(m, dtype=object), s

#### Brown (~6min)

In [None]:
create_matrices("treebank_3", corpus_name="brown")

HBox(children=(FloatProgress(value=0.0, description='Total Progress', max=8.0, style=ProgressStyle(description…

HBox(children=(FloatProgress(value=0.0, description='Folder brown/cg', max=36.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Folder brown/cp', max=28.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Folder brown/cr', max=9.0, style=ProgressStyle(descriptio…




HBox(children=(FloatProgress(value=0.0, description='Folder brown/ck', max=28.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Folder brown/cm', max=6.0, style=ProgressStyle(descriptio…




HBox(children=(FloatProgress(value=0.0, description='Folder brown/cn', max=29.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Folder brown/cl', max=24.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Folder brown/cf', max=32.0, style=ProgressStyle(descripti…





#### WSJ(~7min)

In [None]:
create_matrices("treebank_3", corpus_name="wsj")

HBox(children=(FloatProgress(value=0.0, description='Total Progress', max=27.0, style=ProgressStyle(descriptio…

HBox(children=(FloatProgress(value=0.0, description='Folder wsj/19', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/06', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/23', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/24', max=55.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/00', max=99.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/13', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/05', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/14', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/20', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/02', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/21', max=73.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/11', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/07', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/15', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/04', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/01', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/12', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/09', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/10', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/08', max=21.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/03', max=81.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/16', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/22', max=83.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/17', style=ProgressStyle(description_width='in…




HBox(children=(FloatProgress(value=0.0, description='Folder wsj/18', style=ProgressStyle(description_width='in…





### Compress Dataset

In [None]:
!tar -czf treebank_3.tar.gz treebank_3

### Download Augmented Dataset

In [None]:
from google.colab import files
files.download('treebank_3.tar.gz') 

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
!du -h treebank_3.tar.gz

71M	treebank_3.tar.gz


## Dataset Class

We provide here a `Dataset` class for PyTorch for the Dependency Parsing version of the PennTreeBank dataset. It allows the use of a `PreTrainedTokenizer` from the $\texttt{tokenizers}$ library, but it is not mandatory. It is based on how the `LineByLineTextDataset` from: https://github.com/huggingface/transformers/blob/master/src/transformers/data/datasets/language_modeling.py works.

In [None]:
import torch 
from torch.utils.data import Dataset
import numpy as np
from os.path import join
from os import listdir
"""
PennDP: a Dataset class for the Penn TreeBank Dependency Parsed Dataset
path: path to treebank_3 folder
corpus_name: one of 'wsj' or 'brown' (optional, default 'wsj')
split: whether we want a 'train', 'val' or 'test' split for the data
"""

class PennDP(Dataset):

    def __init__(self, path, corpus_name='wsj', split='train', tokenizer=None):
        super().__init__()
        # We look for all samples in folder
        self.sample_ids = []
        self.sample_sentences = []
        self.sample_matrices = []
        self.sample_tokens = []
        self.tokenizer = tokenizer
        fullpath = join(path, "parsed/mrg" , corpus_name)

        """
        WSJ splits defined according to:
        https://aclweb.org/aclwiki/POS_Tagging_(State_of_the_art)
        
        """
        if corpus_name == 'wsj':
            if split == 'test':
                splits = ["{:02}".format(x) for x in range(22,24)]
            elif split == 'val':
                splits = ["{:02}".format(x) for x in range(19,22)]
            elif split == 'train':
                splits = ["{:02}".format(x) for x in range(19)]
            else:
                splits = ["{:02}".format(x) for x in range(25)] # all of them
        
        if corpus_name == 'brown':
            if split == 'test':
                splits = ['cf','cg','ck','cl','cm']
            elif split == 'val':
                splits = ['cn']
            elif split == 'train':
                splits = ['cp','cr']
            else:
                splits = ['cf','cg','ck','cl','cm','cn','cp','cr']

        for subdir in listdir(fullpath):
            subdirpath = join(fullpath,subdir)
            if ".LOG" not in subdir and "r" != subdir and subdir in splits:
                for sample in listdir(subdirpath):
                    self.sample_ids.append(join(subdir,sample.split(".")[0]))
        
        datapath = join(path, "parsed/pd" , corpus_name)
        for path in self.sample_ids:
            # read sentences
            sentence_path = join(datapath,path + ".txt")
            # read matrices
            matrix_path = join(datapath,path + ".npy")
            
            sentence_file = open(sentence_path, "r")
            matrix_file = np.load(matrix_path, allow_pickle=True)
            lines = sentence_file.read().splitlines()
            sentence_file.close()
            # if available, tokenize sentences
            if self.tokenizer is not None:
                examples = tokenizer(lines, 
                                  add_special_tokens=True,
                                  truncation=True)['input_ids']
                examples =[torch.tensor(e, dtype=torch.long) for e in examples]
                self.sample_tokens.extend(examples)

            lines = [line.split(" ") for line in lines]
            
            for line in lines:
                self.sample_sentences.append(line)
            for m in matrix_file:
                self.sample_matrices.append(m)
        

    def __len__(self):
        return len(self.sample_sentences)

    def __getitem__(self, id):
        if self.tokenizer is None:
            return self.sample_sentences[id], self.sample_matrices[id].transpose()
        else:
            return self.sample_sentences[id], self.sample_matrices[id].transpose(), self.sample_tokens[id]
      

## How to use Example:

#### No tokenizer, particular WSJ or Brown split

In [None]:
data = PennDP("treebank_3", corpus_name="wsj", split="train")

In [None]:
data = PennDP("treebank_3", corpus_name="brown", split="train")

#### Output sample

In [None]:
data[0]

(['Jim',
  'Pattison',
  'Industries',
  'Ltd.',
  ',',
  'one',
  'of',
  'a',
  'group',
  'of',
  'closely',
  'held',
  'companies',
  'owned',
  'by',
  'entrepreneur',
  'James',
  'Pattison',
  ',',
  'said',
  'it',
  '``',
  'intends',
  'to',
  'seek',
  'control',
  "''",
  'of',
  '30%-owned',
  'Innopac',
  'Inc.',
  ',',
  'a',
  'Toronto',
  'packaging',
  'concern',
  '.'],
 array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [1., 1., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 1., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]))

### With Tokenizer, particular split from WSJ or Brown

We first install the required $\texttt{transformers}$ library:

In [None]:
!pip install transformers -q

We then get a pretrained Tokenizer like the one used in BERT:

In [None]:
from transformers import BertTokenizerFast
# Load our training dataset and tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descripti…




In [None]:
data = PennDP("treebank_3", corpus_name="wsj", split='train', tokenizer=tokenizer)

#### Sample Output

In [None]:
data[0]

(['Jim',
  'Pattison',
  'Industries',
  'Ltd.',
  ',',
  'one',
  'of',
  'a',
  'group',
  'of',
  'closely',
  'held',
  'companies',
  'owned',
  'by',
  'entrepreneur',
  'James',
  'Pattison',
  ',',
  'said',
  'it',
  '``',
  'intends',
  'to',
  'seek',
  'control',
  "''",
  'of',
  '30%-owned',
  'Innopac',
  'Inc.',
  ',',
  'a',
  'Toronto',
  'packaging',
  'concern',
  '.'],
 array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [1., 1., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 1., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]),
 tensor([  101,  3104, 25598,  2142, 10699,  4492,   119,   117,  1141,  1104,
           170,  1372,  1104,  4099,  1316,  2557,  2205,  1118, 12035,  1600,
         25598,  2142,   117,  1163,  1122,   169,   169, 20299,  1106,  5622,
          1654,   112,   112,  1104,  1476,   110,   118,  2205,  9859,  4184,
          7409,  3561,   119,   117,   170

Let's see if the tokenizer is working correctly:

In [None]:
sentence, matrix, tokens = data[0]

print("The original sentence is:")
print(" ".join(sentence))
print("The decoded sentence is:")
print(tokenizer.decode(tokens))

The original sentence is:
Jim Pattison Industries Ltd. , one of a group of closely held companies owned by entrepreneur James Pattison , said it `` intends to seek control '' of 30%-owned Innopac Inc. , a Toronto packaging concern .
The decoded sentence is:
[CLS] Jim Pattison Industries Ltd., one of a group of closely held companies owned by entrepreneur James Pattison, said it ` ` intends to seek control'' of 30 % - owned Innopac Inc., a Toronto packaging concern. [SEP]


Everything is looking good!