New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converted pypi to use the simple HTML and JSON APIs #95

Merged
merged 7 commits into from Sep 15, 2014

Conversation

Projects
None yet
3 participants
@yanirs
Copy link
Contributor

yanirs commented Aug 15, 2014

Following the discussion in PR #93, this change includes:

  • Fetching all the PyPI package information using the simple HTML and JSON APIs.
  • Reimplementation of the parser in Python, as the old Ruby parser was partly broken and relied on a deprecated HTML parsing library (hpricot). The new parser is completely different, as it doesn't need to do any HTML parsing, so it only relies on the Python standard library.

@jdorweiler, I haven't included the removal of special characters from PR #94 because I think it'd break some descriptions. Examples: https://pypi.python.org/pypi/02exercicio/1.0.0 and https://pypi.python.org/pypi/genius/3.1.4 -- doesn't encoding the output file in UTF8 address the issue fixed by PR #94? Is there a way for me to test it locally?

Also, as noted in README.md and fetch.sh, is there a better way of speeding up the download than the solution I used? Downloading the JSONs sequentially is painfully slow.

Finally, I wasn't sure whether it's worth including the long package descriptions in the abstract, because according to https://duck.co/duckduckhack/fathead_overview it should be a single sentence. It's now easy to add the description and other fields if people think it'd be useful. Personally, I think that things like the number of downloads and last release date would also be useful (potentially more than the long description).

@jdorweiler

This comment has been minimized.

Copy link
Contributor

jdorweiler commented Aug 15, 2014

@yanirs Thanks for doing this. I'll take a look which package descriptions were causing the problems and get back to you. There's no way for you to test this locally but if you want to include multiple lines you can just separate them with a <br>

@jdorweiler

This comment has been minimized.

Copy link
Contributor

jdorweiler commented Aug 15, 2014

@yanirs You weren't kidding about it being slow. It works though.
selection_161

Not sure how to get around that but maybe running wget quiet will speed it up a little. I think it would be good to have the downloads, release date, and development status on there too. What do you think?

@yanirs

This comment has been minimized.

Copy link
Contributor

yanirs commented Aug 16, 2014

@jdorweiler, that does look nice but it's the old version :)
The new one should have the official homepage link and not have the "package description" prefix (it felt a bit redundant).

Anyway, the new commits include a better fetching script. I think the main problems with the wget solution were a lack of session and connection sharing and hitting the disk too much. With the new solution connections and sessions are shared across requests and all the JSONs get written to a single file. Locally, it took me about 10 minutes to download everything.

I also added the number of downloads, release date and development status, as you suggested. Together with the summary line, I think it gives users enough information whether it's worth looking into the package.

By the way, is it possible to change the sub-heading from "Python" to "Python package"? I think it'd be nice to see the difference between built-in modules and third-party packages. For example, https://duckduckgo.com/?q=python+re and https://duckduckgo.com/?q=python+scipy look very similar.

@jdorweiler

This comment has been minimized.

Copy link
Contributor

jdorweiler commented Aug 16, 2014

@yanirs Sorry, here's the newest version.
selection_163

I'm going to look into why the release/official links don't show. The output.txt file looks good so I think it's something on my end.
I like the new fetch script. Could you also add a line to the readme that says python-dev is required for gevent?
Otherwise I think this looks great! I'll ping you when I figure out the link issue.

@yanirs

This comment has been minimized.

Copy link
Contributor

yanirs commented Aug 17, 2014

@jdorweiler Thanks for testing this.
I updated the readme and also changed "Release date" to "Last release date" to make it less ambiguous. I'm looking forward to seeing it live :)

@mwmiller

This comment has been minimized.

Copy link
Contributor

mwmiller commented Aug 17, 2014

also changed "Release date" to "Last release date"

@yanirs Ccould you make it "Latest" instead of "Last"? It sounds a bit less dire. 😁

@yanirs

This comment has been minimized.

Copy link
Contributor

yanirs commented Aug 17, 2014

@mwmiller sure, I actually had "latest" there before but then decided to go with "last" for some reason...

@jdorweiler

This comment has been minimized.

Copy link
Contributor

jdorweiler commented Aug 22, 2014

@yanirs I looked into the link problem and we actually don't have a way to add the links into an infobox. We have a specific way of doing it with the wikipedia fathead but no general way for other fatheads to use. There might be a general infobox in the future but for now do you want to try appending the links to the actual text? Maybe separate them with a <br> to put them on a new line.

@yanirs

This comment has been minimized.

Copy link
Contributor

yanirs commented Aug 23, 2014

@jdorweiler sure, I'll add the official link to the abstract. Should I also completely remove the external_links column (i.e., leave it empty) or keep it for when the infobox is added in the future?

@jdorweiler

This comment has been minimized.

Copy link
Contributor

jdorweiler commented Aug 23, 2014

@yanirs I'd just leave it empty for now.

@yanirs

This comment has been minimized.

Copy link
Contributor

yanirs commented Aug 24, 2014

@jdorweiler Done. Please let me know if there are any problems with this change.

@jdorweiler

This comment has been minimized.

Copy link
Contributor

jdorweiler commented Aug 26, 2014

@yanirs Cool thanks, I'm travelling right now but I'll get back to you later this week.

@jdorweiler

This comment has been minimized.

Copy link
Contributor

jdorweiler commented Sep 1, 2014

@yanirs Can you remove the links all together for now. I just realized that the code we use internally to deploy fatheads will strip the link out anyway. I'll have to look into using the info box later on. Otherwise I think this looks pretty good!
selection_166

@yanirs

This comment has been minimized.

Copy link
Contributor

yanirs commented Sep 1, 2014

@jdorweiler Do you mean I should remove the homepage link from the abstract? Are you sure it got stripped? The screenshot you attached seems to be an older version -- it should say "Latest release date" rather than "Release date".

@jdorweiler

This comment has been minimized.

Copy link
Contributor

jdorweiler commented Sep 2, 2014

@yanirs Sorry about that. Here's a screen shot of the new one. It does strip out the links though. The "more at" link at the bottom still works but any links in the abstract will be stripped.
screenshot from 2014-09-01 20 04 35

@yanirs

This comment has been minimized.

Copy link
Contributor

yanirs commented Sep 2, 2014

@jdorweiler OK, removed. Should be good to go now :)

@jdorweiler

This comment has been minimized.

Copy link
Contributor

jdorweiler commented Sep 8, 2014

@yanirs Thanks, I put it up here to test it out. Looks good to me but let me know if you run into anything weird. https://ddh4.duckduckgo.com/?q=python+numpy

@jdorweiler

This comment has been minimized.

Copy link
Contributor

jdorweiler commented Sep 15, 2014

@yanirs This looks great and thanks for the help 👍

I'm working on the infobox link issue and can ping you when I get it working if you're still interested in adding those.

jdorweiler pushed a commit that referenced this pull request Sep 15, 2014

jdorweiler
Merge pull request #95 from yanirs/pypi-data-extraction-fixes
Converted pypi to use the simple HTML and JSON APIs

@jdorweiler jdorweiler merged commit 8b21e9d into duckduckgo:master Sep 15, 2014

@yanirs

This comment has been minimized.

Copy link
Contributor

yanirs commented Sep 16, 2014

@jdorweiler Thanks! I'm happy to have a look at adding the infobox links once it's working.

@jdorweiler

This comment has been minimized.

Copy link
Contributor

jdorweiler commented Sep 26, 2014

@yanirs Just went live, Nice work! https://duckduckgo.com/?q=python+numpy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment