Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequence of Sequence encoding for TextClassification? #2839

Closed
faizan30 opened this issue May 14, 2019 · 14 comments
Closed

Sequence of Sequence encoding for TextClassification? #2839

faizan30 opened this issue May 14, 2019 · 14 comments

Comments

@faizan30
Copy link

@faizan30 faizan30 commented May 14, 2019

**System **

  • OS: [Ubuntu]
  • Python version: [e.g. 3.7]
  • AllenNLP version: [e.g. v0.8.3]
  • Question
    I am working on page classification task. In my dataset reader each page is an Instance, the tokens field is a Sequence of TextFields i.e. ListField(TextField, TextField, TextField,,,,). Is there an existing way to do sequence of sequence encoding with allennlp ?
@mojesty

This comment has been minimized.

Copy link
Contributor

@mojesty mojesty commented May 14, 2019

What you can do is something like

tokenized_lines: List[List[Token]]
lines_field = ListField([TextField(tokenized_line) for tokenized_line in tokenized_lines])
target = ListField([LabekField(label=label) for label in labels)  # or simply LabelField
instance = Instance({'lines': lines_field, 'labels': target})

ListField will handle padding automatically for you, but inner TextFields will not, so you have to pad them manually.

@matt-gardner

This comment has been minimized.

Copy link
Member

@matt-gardner matt-gardner commented May 14, 2019

@mojesty, inner text fields should also be getting padded; do you have an example where this isn't working?

@faizan30

This comment has been minimized.

Copy link
Author

@faizan30 faizan30 commented May 15, 2019

@matt-gardner I think text fields are getting padded. I'm having trouble writing an encoder for the TextFields. In the text_field_embedder, I want to use glove embeddings for each token in a Text field, and an encoder that encodes all tokens to a 100 dimension vector. This way each sentence is represented by a 100 dimension vector. I'm having trouble writing this encoder.
My config file is as follows:
{
"dataset_reader": {
"type": "paragraph_reader",
"delimiter": "\t",
"page_id_index": 0,
"doc_id_index": 1,
"label_id_index": 3,
"doc_download_folder": ".data/paragraph_classificattion/",
"tokenizer": {
"word_splitter": {
"language": "en"
}
},
"token_indexers": {
"tokens": {
"type": "single_id",
"lowercase_tokens": true
}
}
},
"train_data_path": ".data/paragraph_classification/train.tsv",
"validation_data_path": "None",
"model": {
"type": "paragraph_classifier",
"text_field_embedder": {
"tokens": {
"type": "sequence_encoding",
"embedding": {
"embedding_dim": 100
},
"encoder": {
"type": "gru",
"input_size": 100,
"hidden_size": 50,
"num_layers": 2,
"dropout": 0.25,
"bidirectional": true
}
}
},
"encoder": {
"type": "gru",
"input_size": 200,
"hidden_size": 100,
"num_layers": 2,
"dropout": 0.5,
"bidirectional": true
},
"regularizer": [
[
"transitions$",
{
"type": "l2",
"alpha": 0.01
}
]
]
},
"iterator": {
"type": "basic",
"batch_size": 32
},
"trainer": {
"optimizer": {
"type": "adam"
},
"num_epochs": 3,
"patience": 10,
"cuda_device": -1
}
}

@faizan30

This comment has been minimized.

Copy link
Author

@faizan30 faizan30 commented May 15, 2019

@matt-gardner I think a similar approach is used by character_encoding but for a sequence of characters. The TokenCharactersEncoder uses a TimeDistributed Class for reshaping the dimensions. I am hoping there is an existing encoding for my case. Or do I need to change my Instance as mentioned by @mojesty ?

Currently, The sequence_encoding in my config file is a copy of character_encoding(clearly this doesnt work)

@mojesty

This comment has been minimized.

Copy link
Contributor

@mojesty mojesty commented May 15, 2019

@matt-gardner if inner TextFields are padded it is super cool, I do not understand padding logic entirely so I assumed that it will break down.
@faizan30 In my code I used simple CharacterTokenizer and then custom Seq2Vec encoder to encode whole line of characters into one vector (in my case that were stacked CNNs), then Seq2SeqEncoder (BiLSTM) before linewise classifier.
With this method, you can esaily switch between characterwise and tokenwise models with just changing tokenizer and using pretrained word embeddings.

@faizan30

This comment has been minimized.

Copy link
Author

@faizan30 faizan30 commented May 15, 2019

@mojesty Is each Instance a sequence of lines in your case? If so how do you process each line separately in the the Seq2Vec encoder. I can only think of a for-loop or is there a better way. Would be very helpful if you could share your config file.

@matt-gardner

This comment has been minimized.

Copy link
Member

@matt-gardner matt-gardner commented May 15, 2019

@faizan30, we have a TimeDistributed module that wraps around things that are in lists. But I think it'd be better to start with a plain-text description of what you're trying to do. I'm not sure why you have a list of TextFields, and I suspect that you don't need them. Can you explain what your model looks like?

@faizan30

This comment has been minimized.

Copy link
Author

@faizan30 faizan30 commented May 16, 2019

@matt-gardner Thanks for the response. I'm trying to classify paragraphs. Each paragraph is long with many sentences. Each sentence is represented by a text field.
Each paragraph is a list of sentences, hence a ListField of TextFields.

I want to get a encoding for each sentence, using an Seq2vec encoder. Then , I want to pass these sentence encodings to another encoder(probably Seq2Seq) followed by a feedforward and projection layer.

@matt-gardner

This comment has been minimized.

Copy link
Member

@matt-gardner matt-gardner commented May 16, 2019

Ok, yes, then you are doing the fields right. The things you need to be careful about are when you call your TextFieldEmbedder you need to add num_wrapping_dims=1, and you need to wrap your Seq2Vec encoder with TimeDistributed(). These are both in your Model.forward() method. Let me know if you need more detail.

@faizan30

This comment has been minimized.

Copy link
Author

@faizan30 faizan30 commented May 16, 2019

@matt-gardner Thank you for the prompt reply. I tried adding num_wrapping_dims=1 and wrapped Seq2Vec with TimeDistributed. My Seq2Vec encoder is part of TextFieldEmbedder, is the the correct approach ? This gives me an index out of range error.

File "/home/fkhan/botml/botml/botai/models.py", line 151, in forward\n embedded_text_input = self.text_field_embedder(tokens, num_wrapping_dims=1)\n!! File "/home/fkhan/botml/env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in call\n result = self.forward(*input, **kwargs)\n!! File "/home/fkhan/botml/env/lib/python3.7/site-packages/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 123, in forward\n token_vectors = embedder(*tensors)\n!! File "/home/fkhan/botml/env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in call\n result = self.forward(*input, **kwargs)\n!! File "/home/fkhan/botml/env/lib/python3.7/site-packages/allennlp/modules/time_distributed.py", line 51, in forward\n reshaped_outputs = self._module(*reshaped_inputs, **reshaped_kwargs)\n!! File "/home/fkhan/botml/env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in call\n result = self.forward(*input, **kwargs)\n!! File "/home/fkhan/botml/botml/botai/sentence_encoder.py", line 41, in forward\n embedded_text = self._embedding(paragraphs)\n!! File "/home/fkhan/botml/env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in call\n result = self.forward(*input, **kwargs)\n!! File "/home/fkhan/botml/env/lib/python3.7/site-packages/allennlp/modules/token_embedders/embedding.py", line 139, in forward\n sparse=self.sparse)\n!! File "/home/fkhan/botml/env/lib/python3.7/site-packages/torch/nn/functional.py", line 1454, in embedding\n return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)\n!! RuntimeError: index out of range at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:191\n'

@matt-gardner

This comment has been minimized.

Copy link
Member

@matt-gardner matt-gardner commented May 16, 2019

This sounds like a vocabulary issue that'll be hard for me to debug remotely, unfortunately. See if you can figure out what the index was, what the size of the embeddings are that it's trying to index into, what the token indexers are doing, etc.

@faizan30

This comment has been minimized.

Copy link
Author

@faizan30 faizan30 commented May 16, 2019

@matt-gardner The indexer is singleid and embedding dimension size is 100. I will debug further and ping if I get more info. Thank you so much for the help.

@mojesty

This comment has been minimized.

Copy link
Contributor

@mojesty mojesty commented May 16, 2019

No need to do for loop, you just create a batch of lines of tokens and everything works fine.

@faizan30

This comment has been minimized.

Copy link
Author

@faizan30 faizan30 commented May 16, 2019

Adding num_embedding_dims =1 works, it turns out I didn't need to TimeDistributed the encoder. Thank you @matt-gardner @mojesty for the help. I really appreciate it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.