Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very long texts #13

Open
filannim opened this issue Nov 7, 2014 · 12 comments
Open

Very long texts #13

filannim opened this issue Nov 7, 2014 · 12 comments

Comments

@filannim
Copy link

filannim commented Nov 7, 2014

I am trying to parse a text which is 1297 characters long but it returns an empty sentence. If I use a different timeout value in the file client.py, let's say 200.0, after that time passes the code raises an jsonrpc.RPCTransportError: timed out exception.

Could you tell me what I am supposed to modify in the code to make client.py work with longer texts?

Thanks,
michele.

@dasmith
Copy link
Owner

dasmith commented Nov 7, 2014

Hi Michele - You're on the right track, but you also have to change the default timeout in the JSON Client library, jsonrpc.py, because it is set to 5 seconds by default.

@filannim
Copy link
Author

By modifying client.py as follows:

nlp = ServerProxy(JsonRpc20(), TransportTcpIp(addr=("127.0.0.1", 8080), timeout=200.0))
text = 'very very long text'
result = json.loads(nlp.parse(text))

I am also changing the timeout variable (in TransportTcpIp class) from 5.0 (default value) to 200.00. Is that variable the one you were talking about?

If I try to run this code on the text described before, I will have a {u'error': u'timed out after 40.000000 seconds'} result.

Is there something else I should change?

Thanks a lot,
michele.

@uday-sherlock
Copy link

I followed steps mentioned by michele, and encountered a ' jsonrpc.RPCTransportError: [Errno 104] Connection reset by peer ' , my file size is approximately 40,000 characters. Running the stanford coreNLP on this file takes about ~300 seconds. When i first encountered the timed out exception, i followed your advice and set my timeout to 1800.
What else would you suggest?

Thank You
Uday

@devisreevvtl
Copy link

It can solve by changing the 'limit' and 'timeout' parameters in the jsonrpc.py by increasing their value.
That is, for long text(eg: 40000 characters), just change limit as 40000 and timeout as 1800.
Thanks...

@filannim
Copy link
Author

The problem is still there. This is what I did.

In corenlp.py (server)

I modified lines #256 and #257, by adding the limit and timeout parameters:

server = jsonrpc.Server(jsonrpc.JsonRpc20(),
                        jsonrpc.TransportTcpIp(addr=(options.host, int(options.port)), limit=50000, timeout=2000.0))

I run the server on the default port 127.0.0.1:8080 and wait it loads the 5 modules.

In my client.py (client)

I wrote a simple client.py (very similar to your client.py) but I added here the same limit and timeout parameters.

import json
from jsonrpc import ServerProxy, JsonRpc20, TransportTcpIp
from pprint import pprint

nlp = ServerProxy(JsonRpc20(), TransportTcpIp(addr=("127.0.0.1", 8080), limit=50000, timeout=2000.0))
doc = "\\n\\n A New York man who was accused of faking his death last summer pleaded guilty to a conspiracy charge Thursday, Nassau County District Attorney Kathleen Rice announced.\\n\\nRaymond Roth, 48, of Massapequa, New York, was first reported missing in the waters off Jones Beach late last July by his 22-year-old son, Jonathan Roth. Several days into an extensive search involving multiple agencies, New York State Park Police said, authorities learned the missing man was in South Carolina, where he had been pulled over for speeding.\\n\\nThe day before Raymond Roth was pulled over, his wife, Evana, showed authorities e-mails she had discovered that appeared to detail a plan between him and his son to fake his death. Raymond Roth wanted his wife and son to collect at least $410,000 in life insurance benefits while he started a new life in Florida, Rice said.\\n\\nState police arrested both men in early August on charges of insurance fraud, conspiracy and filing a false report. Raymond Roth on Thursday agreed to plead guilty to the conspiracy charge in exchange for a sentence of 90 days in jail and five years\\' probation, the district attorney\\'s office said. He also must pay restitution for the cost of the search -- $27,445 to the U.S. Coast Guard and $9,109 to the Nassau County Police Department.\\n\\nEvana Roth told CNN in August she thought her husband devised the plan after he was fired from his job in July. Her attorney, Lenard Leeds, said she had been unaware of the ruse before she uncovered the e-mail correspondence.\\n\\n\"There needs to be a way for me to find out how things are going. Call me Sunday night at 8 PM at the resort,\" Raymond Roth wrote in an e-mail to his son the day before the son reported him missing.\\n\\nThe son\\'s case is still pending, the district attorney said. Jonathan Roth\\'s attorney, Joey Jackson, defended his client after his arrest, saying, \"There was abuse here, manipulation here, coercion here\" from the father.\\n\\nRaymond Roth\\'s attorney, Brian Davis, denied in August that Roth had involved his son in the scheme.\\n\\n\"We had issues concerning the facts people had whether (Roth) had an agreement with his son,\" Davis told CNN on Thursday. \"He\\'s admitted it now. He\\'s accepted responsibility.\"\\n\\nDavis added that his client has been under treatment for bipolar disorder in recent weeks.\\n\\nDuring plea negotiations, Raymond Roth asked the district attorney\\'s office not to give his son jail time, Davis said.\\n\\nOn the advice of both their attorneys, father and son have not been in contact since their arrests, Davis said.\\n\\n\"He would like to straighten things out with (Jonathan) when the time comes,\" he said.\\n\\n\\n\\n\\n"
print 'Doc lenght: {}'.format(len(doc))
pprint(json.loads(nlp.parse(doc)))

Result

This is what I get:

Doc lenght: 2682
{u'error': u'timed out after 137.100000 seconds'}

Notice that the limit parameter is set way beyond the actual size of that document (2682). Also, if I visualise the character received by the server I got just the first 1016 chars of the original text:

INFO:__main__:Serving on http://127.0.0.1:8080
ERROR:__main__:Error: Timeout with input '\n\n A New York man who was accused of faking his death last summer pleaded guilty to a conspiracy charge Thursday, Nassau County District Attorney Kathleen Rice announced.\n\nRaymond Roth, 48, of Massapequa, New York, was first reported missing in the waters off Jones Beach late last July by his 22-year-old son, Jonathan Roth. Several days into an extensive search involving multiple agencies, New York State Park Police said, authorities learned the missing man was in South Carolina, where he had been pulled over for speeding.\n\nThe day before Raymond Roth was pulled over, his wife, Evana, showed authorities e-mails she had discovered that appeared to detail a plan between him and his son to fake his death. Raymond Roth wanted his wife and son to collect at least $410,000 in life insurance benefits while he started a new life in Florida, Rice said.\n\nState police arrested both men in early August on charges of insurance fraud, conspiracy and filing a false report. Raymond Roth on Thursday agreed to plead gu'

I think there's something in the StanfordCoreNLP._parse method in corenlp.py.

PS. Apologies for the late reply (for some reasons I didn't receive any notification via e-mail).

@abhigenie92
Copy link

Any solution to this?

@alvations
Copy link

Confirmed that changing default timeout at https://github.com/dasmith/stanford-corenlp-python/blob/master/jsonrpc.py#L746 to something like 200 seconds works for pathologically long sentences.

@hyuglim
Copy link

hyuglim commented Mar 6, 2016

in jsonrpc.py, string search for "5.0" and change all those to a larger number, and it solved the problem for me

@akornilo
Copy link

This is a late follow-up, but this suggested fix does not work for me. I've set the timeouts to about 300s, but it appears that corenlp still times out. Did anybody have a different solution?

In particular, if I try to parse a chunk of 1000 chars it goes through fine, but trying to parse 1024 chars breaks.

Edit: It appears that the issue comes from CoreNLP itself - the command line interface there uses a buffer which only reads in 1024 chars at a time, and since this library essentially uses the shell, it will break accordingly on longer strings.
As far as I can tell, there is no way to fix this without using a custom copy of stanford's corenlp.

@darthbhyrava
Copy link

@dasmith Could you please look into this?
There is a corresponding set of stackoverflow questions, too, but they aren't working for me.
https://stackoverflow.com/questions/32550162/
https://stackoverflow.com/questions/41260313/

It would be great if you could respond to this, thanks.

@dasmith
Copy link
Owner

dasmith commented May 31, 2019 via email

@darthbhyrava
Copy link

I was checking my notifications after a long time, and I just saw this.

I have been thinking of a reply for the past half an hour, I don't know what to say. I did not expect to find something so deafeningly loud when I clicked on that bell icon. I don't even know how I feel overwhelmed by those five words, despite being sensitive, I am just a random stranger on the internet who happened to use Dustin's code.

But for the last half an hour, I've been sitting in silence. I've tried to read about him; I found his obituary, online, and I imagine a very smart and kind man, someone who might have chuckled a bit at an NLP question I might have asked at the end of a class.

'Beloved son, brother, uncle and friend. Known for his incredible mind and infectious sense of humor, Dustin leaves a lasting legacy of compassion and tenderness.', the words go. His friends said that 'Dustin’s bright blue eyes could light up any room'.

And in one of his pictures Dustin was glad to share on his MIT page, I can see the warm smile which they were talking about, a smile which once perhaps spontaneously broke out in the middle of a gathering and spread around like a warm breeze on a spring morning.

The world is a lesser place for your loss, Dustin.
May you rest in peace.

Dustin passed away Feb. 2015

On Wed, Feb 21, 2018 at 8:32 AM Sriharsh Bhyravajjula < @.***> wrote: @dasmith https://github.com/dasmith Could you please look into this? There is a corresponding set of stackoverflow questions, too, but they aren't working for me. https://stackoverflow.com/questions/32550162/ https://stackoverflow.com/questions/41260313/ It would be great if you could respond to this, thanks. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOZWzm_9oQV90xRusxNo5fM8ebIsTWuks5tXBr8gaJpZM4C4PRB .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants