Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unshorten.py produces invalid json #54

Closed
ruebot opened this issue Feb 3, 2015 · 11 comments
Closed

unshorten.py produces invalid json #54

ruebot opened this issue Feb 3, 2015 · 11 comments

Comments

@ruebot
Copy link
Member

ruebot commented Feb 3, 2015

I confirmed a collection's validity with validate.py prior to running unshorten.py. After running unshorten.py on a collection, and checking it's validity with validate.py, I gets lots of errors.

Sample:

uhoh, we got a problem on line: 9376769
No JSON object could be decoded
uhoh, we got a problem on line: 9413314
No JSON object could be decoded
uhoh, we got a problem on line: 9457029
No JSON object could be decoded
uhoh, we got a problem on line: 9470191
No JSON object could be decoded
uhoh, we got a problem on line: 9474397
No JSON object could be decoded
uhoh, we got a problem on line: 9500591
No JSON object could be decoded
uhoh, we got a problem on line: 9506738
No JSON object could be decoded
uhoh, we got a problem on line: 9517267
No JSON object could be decoded
uhoh, we got a problem on line: 9545542
No JSON object could be decoded
uhoh, we got a problem on line: 9567288
No JSON object could be decoded
uhoh, we got a problem on line: 9632298
No JSON object could be decoded
uhoh, we got a problem on line: 9676049
No JSON object could be decoded
uhoh, we got a problem on line: 9689651
No JSON object could be decoded
uhoh, we got a problem on line: 9761634
No JSON object could be decoded
uhoh, we got a problem on line: 9773360
No JSON object could be decoded
uhoh, we got a problem on line: 9943500
No JSON object could be decoded
uhoh, we got a problem on line: 9967734
No JSON object could be decoded
uhoh, we got a problem on line: 10024047
No JSON object could be decoded
uhoh, we got a problem on line: 10063945
No JSON object could be decoded

The invalid json then prevents me from getting a list of the top urls in a collection with urls.py because there are many invalid json objects in the file.

$ cat JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweets-unshortened-urls-20150129.json | ~/git/twarc/utils/urls.py | sort | uniq -c | sort -n > JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-urls-20150129.txt
Traceback (most recent call last):
  File "/home/nruest/git/twarc/utils/urls.py", line 11, in <module>
    tweet = json.loads(line)
  File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded

Is it trivial to have unshorten.py write back valid json?

@edsu
Copy link
Member

edsu commented Feb 3, 2015

I think I've noticed unshorten writing empty lines sometimes. Is that what your invalid lines look like?

@ruebot
Copy link
Member Author

ruebot commented Feb 4, 2015

Here is one (sorry it took a while, I couldn't get awk or sed to print out a specifc line number from this file):

$ head -9766769 all-tweets-unshortened-urls-20150129.json | tail -1

{"contributors": null, "truncated": false, "text": "RT @RaphBotts: Sur la place d'arme il y a un monde c'est du jamais vu \ud83d\ude31 #JesuisCharlie", "in_reply_to_status_id": null, "id": 552955311666253825, "favorite_count": 0, "source": "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>", "retweeted": false, "coordinates": null, "entities": {"symbols": [], "user_mentions": [{"indices": [3, 13], "id_str": "798216505", "screen_name": "RaphBotts", "name": "Raph.", "id": 798216505}], "hashtags": [{"indices": [72, 86], "text": "JesuisCharlie"}], "urls": []}, "in_reply_to_screen_name": null, "in_reply_to_user_id": null, "retweet_count": 9, "id_str": "552955311666253825", "favorited": false, "retweeted_status": {"contributors": null, "truncated": false, "text": "Sur la place d'arme il y a un monde c'est du jamais vu \ud83d\ude31 #JesuisCharlie", "in_reply_to_status_id": null, "id": 552914408524226560, "favorite_count": 5, "source": "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>", "retweeted": false, "coordinates": null, "entities": {"symbols": [], "user_mentions": [], "hashtags": [{"indices": [57, 71], "text": "JesuisCharlie"}], "urls": []}, "in_reply_to_screen_name": null, "in_reply_to_user_id": null, "retweet_count": 9, "id_str": "552914408524226560", "favorited": false, "user": {"follow_request_sent": false, "profile_use_background_image": true, "profile_text_color": "333333", "id": 798216505, "verified": false, "profile_location": null, "profile_image_url_https": "https://pbs.twimg.com/profile_images/551383654770159616/9ljxadjS_normal.jpeg", "profile_sidebar_fill_color": "DDEEF6", "contributors_enabled": false, "entities": {"url": {"urls": [{"url": "http://t.co/YRVxCwwRdm", "indices": [0, 22], "expanded_url": "http://www.facebook.com/raphael.bottreau", "display_url": "facebook.com/raphael.bottre\u2026"}]}, "description": {"urls": []}}, "followers_count": 406, "profile_sidebar_border_color": "000000", "location": "Poitiers city / \u00cele de R\u00e9", "default_profile_image": false, "id_str": "798216505", "is_translation_enabled": false, "utc_offset": 3600, "statuses_count": 11858, "description": "Kim. La Clic. La Famille.", "friends_count": 308, "profile_link_color": "0084B4", "profile_image_url": "http://pbs.twimg.com/profile_images/551383654770159616/9ljxadjS_normal.jpeg", "notifications": false, "geo_enabled": false, "profile_background_color": "C0DEED", "profile_banner_url": "https://pbs.twimg.com/profile_banners/798216505/1419937479", "profile_background_image_url": "http://pbs.twimg.com/profile_background_images/510775501359570944/tCPBY6MM.jpeg", "name": "Raph.", "lang": "fr", "following": false, "profile_background_tile": true, "favourites_count": 855, "screen_name": "RaphBotts", "url": "http://t.co/YRVxCwwRdm", "created_at": "Sun Sep 02 12:56:22 +0000 2012", "profile_background_image_url_https": "https://pbs.twimg.com/profile_background_images/510775501359570944/tCPBY6MM.jpeg", "time_zone": "Paris", "protected": false, "default_profile": false, "is_translator": false, "listed_count": 1}, "geo": null, "in_reply_to_user_id_str": null, "lang": "fr", "created_at": "Wed Jan 07 19:47:22 +0000 2015", "in_reply_to_status_id_str": null, "place": null, "metadata": {"iso_language_code": "fr", "result_type": "recent"}}, "user": {"follow_request_sent": false, "profile_use_background_image": true, "profile_text_color": "618238", "id": 280417651, "verified": false, "profile_location": null, "profile_image_url_https": "https://pbs.twimg.com/profile_images/551495029530066944/MqxgVTtE_normal.jpeg", "profile_sidebar_fill_color": "060A00", "contributors_enabled": false, "entities": {"url": {"urls": [{"url": "http://t.co/3VPPJFFK5v", "indices": [0, 22], "expanded_url": "http://instagram.com/ludivinee_", "display_url": "instagram.com/ludivinee_"}]}, "description": {"urls": []}}, "followers_count": 222, "profile_sidebar_border_color": "FFFFFF", "location": "France", "default_profile_image": false, "id_str": "280417651", "is_translation_enabled": false, "utc_offset": 3600, "statuses_count": 6609, "description": ".", "friends_count": 212, "profile_link_color": "E6A84A", "profile_image_url": "http://pbs.twimg.com/profile_images/551495029530066944/MqxgVTtE_normal.jpeg", "notifications": false, "geo_enabled": false, "profile_background_color": "000000", "profile_banner_url": "https://pbs.twimg.com/profile_banners/280417651/1412534415", "profile_background_image_url": "http://pbs.twimg.com/profile_background_images/378800000086764942/264e232fb90bd5297dfb2d9ff5b4db62.jpeg", "name": "Ukuthula \u270f", "lang": "fr", "following": false, "profile_background_tile": true, "favourites_count": 1146, "screen_name": "_Ludivinee", "url": "http://t.co/3VPPJFFK5v", "created_at": "Mon Apr 11 08:46:38 +0000 2011", "profile_background_image_url_https": "https://pbs.twimg.com/profile_background_images/378800000086764942/264e232fb90bd5297dfb2d9ff5b4db62.jpeg", "time_zone": "Paris", "protected": false, "default_profile": false, "is_translator": false, "listed_count": 1}, "geo": null, "in_reply_to_user_id_str": null, "lang": "fr", "created_at": "Wed Jan 07 22:29:54 +0000 2015", "in_reply_to_status_id_str": null, "place": null, "metadata": {"iso_language_code": "fr", "result_type": "recent"}}

I can edit this comment and add more examples if need be.

@edsu
Copy link
Member

edsu commented Feb 4, 2015

I'm confused that's valid json :-) Perhaps just update validate.py to output the JSON it failed to parse in the log message?

@ruebot
Copy link
Member Author

ruebot commented Feb 4, 2015

...it is! I wonder if it printed the wrong line.

When I try and print a specific line from sample output from validate.py from above with awk, sed, or grep, I just get a blank line back. I wonder if that has something to do with it?

...and it looks like head and tail just did the same.

[nruest@gorille:all]$ head -9413314 all-tweets-unshortened-urls-20150129.json | tail -1

[nruest@gorille:all]$                                                                  

@edsu
Copy link
Member

edsu commented Feb 4, 2015

Yes, I think that the invalid json line is simply a blank line. That's what I've seen in the past with unshorten.py's output. Thanks!

@ruebot
Copy link
Member Author

ruebot commented Feb 4, 2015

Well, now I feel silly for thinking all these *nix tools weren't behaving correctly for over a day 😄

@edsu edsu closed this as completed in 23f7cc9 Feb 4, 2015
@edsu
Copy link
Member

edsu commented Feb 4, 2015

When you get a chance try running the data through unshorten.py again and see if it is improved?

@ruebot
Copy link
Member Author

ruebot commented Feb 4, 2015

I'll fire it up now!

@edsu
Copy link
Member

edsu commented Feb 4, 2015

Reopening ; I'm not sure my fix worked. I will verify on a small dataset :-)

@edsu edsu reopened this Feb 4, 2015
@ruebot
Copy link
Member Author

ruebot commented Feb 4, 2015

Cool. I'll wait until you verify.

@edsu edsu closed this as completed in cd55076 Feb 4, 2015
@edsu
Copy link
Member

edsu commented Feb 4, 2015

Ok, now give it a try :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants