New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Efficiently generate summaries from new data? #51
Comments
Tips for anyone running decoding using a pretrained model:
Here, "vocab" is the from the "finished" cnn+daily mail data and "pretrained_model" is, e.g., one of the pretrained models kindly made available by the author. The summarization results will then be located in a decoded folder "decode" inside a new folder of the sort "decode_test_500maxenc..." |
@ibarrien What changes did you do in make_datafiles.py? In the cnn-dailymail code, input is fed in the form of urls, there how to give only as input and generate new data? Could you please shed some light? |
I am having a similar issue - I am able to successfully run the pre-trained model of TextSum (Tensorflow 1.2.1). The output consists of summaries of CNN & Dailymail articles (which are chunked into bin format prior to testing). I have also been able to create the aforementioned bin format test data for CNN/Dailymail articles & vocab file (per instructions here). However, I am not able to create my own test data to check how good the summary is. I have tried modifying the make_datafiles.py code to remove hard coded values. I am able to create tokenized files, but the next step seems to be failing. It'll be great if someone can help me understand what "For each of the url lists all_train.txt, all_val.txt and all_test.txt, the corresponding tokenized stories are read from file, lowercased and written to serialized binary files train.bin, val.bin and test.bin. These will be placed in the newly-created finished_files directory." How is a URL such as http://web.archive.org/web/20150401100102id_/http://www.cnn.com/2015/04/01/europe/france-germanwings-plane-crash-main/ being mapped to the corresponding story in my data folder? If someone has had success with this, please do let me know how to go about this. Thanks in advance! |
If you are trying to test the model on your own data, you can avoid dealing with the |
@pruthvishetty When I ran the |
@microLizzy Make this change in your script: |
@pruthvishetty That works! Thanks so much! |
@pruthvishetty did you manage to disable ROUGE and still run single_pass? Which line did you comment? Thanks! |
@jonaseltes there are just a couple of lines for rouge function, in decode.py file. Just find where those functions are use and comment em, really easy task. |
How do folks typically generate summaries given i) new text files (only content, no headlines/abstracts/urls) and ii) the pretrained model?
It seems that make_datafiles.py from the cnn-dailymail dir needs to be modified (e.g. removing much of the hardcoding). After these modifications, make_datafiles.py may be used to tokenize and chunk the new data. From there, we "decode" using the pretrained model to generate new summaries.
Is the above generally correct or is there a more efficient method?
The text was updated successfully, but these errors were encountered: