Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Initial scholr.ly integration. #44

Closed
wants to merge 4 commits into from

3 participants

@mhluongo

I wanted to make sure everything worked before we hook up to production data.

@rpicard

Hey there! Sorry it took a couple of days to get to you. I thought GitHub was set-up to email me when new pull requests are submitted, but it looks like I was wrong!

Now, in response to your code:

  • What is the overall purpose of this plugin? Just so I can get a better idea of who will be using it, and what it helps them do.
  • It looks like this uses "test data." Is there any reason we aren't using the real data yet?
  • You're using some custom CSS in the abstract template. It's generally bes t to stick with plain text in the abstract. Some light markup (i.e. italics) is warranted in some cases, but generally unnecessary.
@nospampleasemam

Hi @mhluongo,

I'm Dylan, and we've been talking a bit over email. I was waiting to respond until I had a few more things figured out, but I'm glad @rpicard commented, because it brought us to a discussion where he was able to solve one of the issues I was having (on our end). I've reviewed the code, and everything looks good. I'm working on setting up a development machine where we'll be able to test the deployment and review the result.

Please allow us another few days to make sure everything is right, and I will update here.

Thanks again :-)

@mhluongo

@rpicard @nospampleasemam thanks guys, looking forward to it.

@mhluongo

Is everything square on you guys' end?

@rpicard

Hey @mhluongo. I'm really sorry about the delay! We absolutely should not have kept you waiting like this. Dylan (@nospampleasemam) and I are making it a priority to work out the problems we ran into in getting your plugin ready as soon as we can.

I'll keep you updated!

@rpicard

Okay, we've been reviewing the plugin and we have a couple of fixes we'd like you to make:

  • Remove the HTML / CSS from the abstract.
    We prefer to keep things as text. Sometimes HTML can make its way in there if it's needed, but I don't think that's the case here. This includes the links too, since they can actually mess with the ZCI styles (it becomes a link within a link).

  • Reformat the abstract
    We were thinking that this would be a good format:

Bernd Girod is a researcher interested in model-based video encoding, mpeg-4, video coding and multiframe prediction. Girod has 326 papers with 0 coauthors and 1970 citations.

The things to note here are:

  • We rearranged the last sentence
  • We used his last name as the subject of the last sentence
  • If any number is 1, the word (e.g. coauthors, or papers) should switch to the singular (e.g. coauthor or paper)

We both really like the plugin, and we're excited about getting it live. When do you want to make a full dataset available?

@mhluongo

Hm, in Austin for a conference this weekend- I should be able to take care of those changes and make a full dataset available early next week. Exciting stuff :)

@rpicard

Sounds good. Enjoy the conference!

@mhluongo

I just pushed the requested changes. The only thing I'm not certain about is removing the links from keywords- I understand why you guys need that done, but I wish there were an easy way to include related topics with each author (say by linking to a DDG search, or a Wiki instant answer). Anyway, that might be an interesting thing to work on down the road.

I'll have a proper data URL for you guys by tomorrow at the latest. I want to make sure we expose only the most important / complete author profiles for now to keep relevant, and then expand as we're more confident in our data.

@rpicard

@mhluongo Thanks for updating the data. It looks good now. I'll have some others take a look at it while you get that data ready.

@mhluongo

@rpicard I fixed some things to handle names with non-ASCII characters, and added an initial data URL for ~72k authors that I think will be a good start. I figure that every once in a while the URL can be updates, as we get more data and better disambiguate what we have.

The output.txt is now UTF-8 encoded- hopefully that won't be a problem. LMK if there's anything else I can do.

@rpicard

Right now the category doesn't work because there is a limit to how many items we put in a single category. Since all 140,000+ are in the "researchers" category, we'd need to separate them or forgo the category altogether. If you're interested, I think we could make categories for their areas of interest (e.g. 'Social dependency researchers' for https://robert.duckduckgo.com/?q=Ban+Al-ani).

@mhluongo
@rpicard

Ten thousand per category is going to be too much. I think it makes sense to skip the category here.

That typo was on our end; I just fixed it. Right now the plugin is going through our brief QA process, but I think I've addressed all of the main issues so, barring some big revelation, it should be ready to launch in the next day or so. :)

@mhluongo
@rpicard

@mhluongo Hey, just wanted to give you a quick update. We're still working out some kinks on our end with triggers. Unfortunately the favicon won't be there on the main site either because it doesn't appear to be working on the service we use (https://getfavicon.appspot.com/). I've blacklisted it so a blank image doesn't appear in its place.

@mhluongo

I've stopped using a redirect for the favicon and made sure it's served directly by our webserver, instead of S3- hopefully that will help. Let me know if you guys want to spitball about triggers- I'd be happy to do what I can to help.

@rpicard

@mhluongo I'll keep an eye on the service to see if it updates with that change. The trigger problems that are left aren't problems with which triggers we're using, it's just some technical issues with them not working if they are used after the search term, e.g. alex pentland scholr.ly instead of scholr.ly alex pentland where scholr.ly is the trigger. I'm thinking that I'll go ahead and deploy anyways and then look into that problem separately so we don't delay getting this out there any longer!

@rpicard

@mhluongo The plugin has been deployed! :rocket:

I was able to fix the triggering issues and it's all live.

I'll merge this pull request into the repo and close it out. Let me know if you have any questions!

@mhluongo

Excellent! Thanks :)

@mhluongo

Obviously this isn't a blocker, but is there anything else I can do to get that favicon working? I've moved it to http://scholr.ly/favicon.ico (no redirects like before) and have a link to it from every page on the site (<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico">). We haven't had problems with any modern browsers, but I still can't seem to get it to show up on the service you guys use..

@rpicard

@mhluongo I am really not sure why it isn't working. I've tried decaching it on their servers (they have a decache URL) but it hasn't helped. We are looking at implementing our own favicon retriever down the line, but right now it isn't a priority, so we'll have to see. I've submitted an issue on their issues tracker: potatolondon/getfavicon#16.

@rpicard rpicard closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Feb 18, 2013
  1. @mhluongo
Commits on Apr 3, 2013
  1. @mhluongo
Commits on Apr 5, 2013
  1. @mhluongo
  2. @mhluongo

    Added a new download URL.

    mhluongo authored
This page is out of date. Refresh to see the latest.
View
29 lib/DDG/Fathead/Scholrly.pm
@@ -0,0 +1,29 @@
+package DDG::Fathead::Scholrly;
+
+use DDG::Fathead;
+
+primary_example_queries "scholrly alex pentland";
+
+secondary_example_queries
+ "charles isbell",
+ "sch greg abowd";
+
+description "Scholrly researcher profiles";
+
+name "Scholrly";
+
+icon_url "";
+
+source "Scholrly";
+
+code_url "https://github.com/duckduckgo/zeroclickinfo-fathead/tree/master/share/scholrly";
+
+topics "computing", "math", "science", "programming";
+
+category "reference";
+
+attribution
+ github => ['https://github.com/scholrly', 'scholrly'],
+ twitter => ['https://twitter.com/scholrly', 'scholrly'];
+
+1;
View
2  share/scholrly/README.txt
@@ -0,0 +1,2 @@
+Dependencies:
+ Python 2.6+
View
4 share/scholrly/fetch.sh
@@ -0,0 +1,4 @@
+#!/bin/bash
+mkdir -p download
+# XXX - starting with test data
+curl "https://s3.amazonaws.com/scholrly-external/ddg-2013-4-3.tsv" --output "download/download.tsv"
View
5 share/scholrly/meta.txt
@@ -0,0 +1,5 @@
+Name: Scholrly
+Domain: scholr.ly
+Type: Research
+MediaWiki: 0
+Keywords: scholrly, schol, sch,research,researcher
View
129 share/scholrly/parse.py
@@ -0,0 +1,129 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+from collections import namedtuple
+import json, itertools, urllib, re, sys
+
+ABSTRACT_TEMPLATE = unicode("""
+{name} is a researcher{keyword_phrase}. {last_name} has written {num_papers} paper{paper_prefix} with {num_coauthors} coauthor{coauthor_prefix} and {num_citations} citation{citation_prefix}.
+""")
+
+AUTHOR_CATEGORIES = ['researchers']
+
+DownloadRow = namedtuple('DownloadRow',
+ ['names','url','image_url','num_coauthors', 'num_papers',
+ 'num_citations','keywords'])
+
+class ParsedDownloadRow(DownloadRow):
+ @property
+ def names(self):
+ try:
+ return json.loads(super(ParsedDownloadRow, self).names)
+ except:
+ return []
+
+ @property
+ def keywords(self):
+ try:
+ return json.loads(super(ParsedDownloadRow, self).keywords)
+ except:
+ return []
+
+DDGOutputRow = namedtuple('DDGOutputRow',
+ ['title', 'type', 'redirect', 'other_uses', 'categories', 'references',
+ 'see_also', 'further_reading', 'external_links', 'disambiguation',
+ 'images', 'abstract', 'source_url'])
+
+def replace_whitespace(s):
+ return unicode(s).replace('\t',' ').replace('\n', ' ').replace('\r', ' ')
+
+WHITESPACE_PATTERN = re.compile(r'\s+')
+
+def minify_whitespace(s):
+ return WHITESPACE_PATTERN.sub(' ', s)
+
+def ddg_search_url(query):
+ return 'https://duckduckgo.com/?%s' % urllib.urlencode({'q':query})
+
+def format_keywords(keywords):
+ linked_kw = [kw.lower() for kw in keywords]
+ first_part = ', '.join(linked_kw[:-2])
+ second_part = ' and '.join(linked_kw[-2:])
+ parts = [part for part in [first_part, second_part] if len(part) > 0]
+ return ', '.join(parts)
+
+def output_from_row(row):
+ # generate the main page
+ if len(row.names) == 0 or len(row.keywords) == 0:
+ return ''
+
+ # NB these templating funcs expect n >= 0
+ def number_or_no(n):
+ return unicode(n) if n > 0 else 'no'
+
+ def plural_suffix(n):
+ return 's' if n > 1 or n == 0 else ''
+
+ keyword_phrase = ' interested in %s' % format_keywords(row.keywords) \
+ if len(row.keywords) > 0 else ''
+
+ # NB this is not the best way to handle last names (at all), but should
+ # work for the majority of cases right now
+ last_name = row.names[0].split()[-1]
+
+ num_coauthors = number_or_no(row.num_coauthors)
+ coauthor_prefix = plural_suffix(row.num_coauthors)
+
+ num_papers = number_or_no(row.num_papers)
+ paper_prefix = plural_suffix(row.num_papers)
+
+ num_citations = number_or_no(row.num_citations)
+ citation_prefix = plural_suffix(row.num_citations)
+
+ article = DDGOutputRow(title=row.names[0],
+ type='A',
+ redirect='',
+ other_uses='',
+ categories='\\n'.join(AUTHOR_CATEGORIES),
+ references='',
+ see_also='',
+ further_reading='',
+ external_links='[%s More at Scholrly]' % row.url,
+ disambiguation='',
+ images='[[Image:%s]]' % row.image_url,
+ abstract=minify_whitespace(
+ ABSTRACT_TEMPLATE.format(
+ name=row.names[0],
+ last_name=last_name,
+ num_coauthors=num_coauthors,
+ coauthor_prefix=coauthor_prefix,
+ num_papers=num_papers,
+ paper_prefix=paper_prefix,
+ num_citations=num_citations,
+ citation_prefix=citation_prefix,
+ keyword_phrase=keyword_phrase)),
+ source_url=row.url)
+ # generate redirects for any aliases
+ redirects = [DDGOutputRow(title=name, type='R',redirect=row.names[0],
+ other_uses='',categories='',references='',
+ see_also='',further_reading='',external_links='',
+ disambiguation='', images='', abstract='',
+ source_url='')
+ for name in row.names[1:]]
+ return '\n'.join('\t'.join(replace_whitespace(el) for el in row)
+ for row in [article] + redirects)
+
+used_names = set()
+
+if __name__ == '__main__':
+ with open(sys.argv[1]) as data_file:
+ # read in the downloaded data, skipping the header
+ rows = (ParsedDownloadRow(*line.split('\t'))
+ for line in itertools.islice(data_file, 1, None))
+ with open(sys.argv[2], 'a') as output_file:
+ for row in rows:
+ # make sure we don't use a name twice, since we don't do disambig
+ # pages yet
+ if all(name not in used_names and not used_names.add(name)
+ for name in row.names):
+ output_file.write(output_from_row(row).encode('utf8') + '\n')
View
2  share/scholrly/parse.sh
@@ -0,0 +1,2 @@
+#!/bin/bash
+python parse.py download/download.tsv output.txt
View
4 share/scholrly/queries.txt
@@ -0,0 +1,4 @@
+bernd girod
+aaron bobick
+sandy pentland
+researcher a. bobick
Something went wrong with that request. Please try again.