Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contents on EPrints repository is not featuring on Google Scholar #450

Closed
jiadiyao opened this issue Jul 18, 2017 · 6 comments
Closed

Contents on EPrints repository is not featuring on Google Scholar #450

jiadiyao opened this issue Jul 18, 2017 · 6 comments

Comments

@jiadiyao
Copy link
Contributor

@jiadiyao jiadiyao commented Jul 18, 2017

The problem

A number of repository administrators have noticed that their content is not featuring in Google Scholar search results.
We have been in discussion with Google Scholar in regard to how it discovers and indexes the contents of EPrints repositories.

While EPrints is by design crafted to present its content to Google in best way, Google Scholar is encountering issues around the initial discovery of the content.
Google’s crawler processes 100s of billions of links, and it needs a clearer way to identify that a link is to an EPrints repository rather than a normal website.
This would then allow Google Scholar to prioritise the crawling and indexing. Google Scholar already has EPrints specific rules in its crawler, and they are happy to update them.

@jiadiyao
Copy link
Contributor Author

@jiadiyao jiadiyao commented Jul 18, 2017

The solution

Google Scholar and I have come up with a plan to increase the discoverability of EPrints content.

Currently, records on EPrints have URLs which look like
http://YOUR-REPO/EPRINTID/ eg http://irep.ntu.ac.uk/12853/
However this is not easily identified as EPrints content without visiting the actual page, and Google has a lot of pages to visit.

We intend to promote the existing EPrints “URI” form of the links, which are easily identified as being EPrints content.
http://YOUR-REPO/id/eprint/EPRINTID/ eg http://irep.ntu.ac.uk/id/eprint/12853/
Currently the longer form of the URL redirects to the shorter version. And we would like to swap that around so that the shorter redirects the to the longer version.
That way no existing links will stop working, but gradually references to your repository, and more importantly Google's indexer will use the longer identifiable version.

Document URLs would need to be changed in a similar way, again any existing links would continue to work, but the promoted version of the links would change from
http://irep.ntu.ac.uk/12853/1/185527_3220%20Heasell%20prepublilsher.pdf
to
http://irep.ntu.ac.uk/id/eprint/12853/1/185527_3220%20Heasell%20prepublilsher.pdf

Justin

@jiadiyao
Copy link
Contributor Author

@jiadiyao jiadiyao commented Jul 18, 2017

We have made the changes described above locally and they have proved successful.
Now we have now also applied the changes to the EPrints core.
These changes are disabled by default, but can be enabled by updating your 20_base_urls.pl to include
$c->{use_long_url_format} = 1;

If you apply these changes and would like Google Scholar to prioritise a reindex of your repository, get in touch with us and we’ll pass the message along to them.

Justin

@jesusbagpuss
Copy link
Contributor

@jesusbagpuss jesusbagpuss commented Jul 19, 2017

Hi Jiadi,
I can see the logic in this, but I think there are other aspects that need a bit more consideration.
For an item that is not in the live archive, the URL /id/eprint/XXX will take you to the login page, or the item control page.
For a retired item, the URL /XXX will take you to the tombstone page (which will link to other versions if they exist).
For other items, it will take you to a 404 page.

The login page returns a 401 - which is NOT correct in this instance.

I think this change needs a bit more thought about how non-live items are reflected to Google - and 'normal' users.

@jiadiyao
Copy link
Contributor Author

@jiadiyao jiadiyao commented Jul 21, 2017

Hi John,
Thanks for you feedback!
The changes we made did not alter the return code or any redirection already in place in EPrints.
The odd return code for the login page was there before. What happens is that the login cgi script is a secured page, which would prompt user to login first. Once the user is logged in, it just redirect to whatever the "target" parameter says.

@prazetyo
Copy link

@prazetyo prazetyo commented Sep 2, 2018

Hi Jiadi,
if I upgrade from 3.3.12 to 3.4 and use use_long_url_format, is there no problem with the url that Google Scholar indexed earlier in version 3.3.12?

Agung Prasetyo W.

@DavidGrau
Copy link

@DavidGrau DavidGrau commented Aug 30, 2021

We have made the changes described above locally and they have proved successful.
Now we have now also applied the changes to the EPrints core.
These changes are disabled by default, but can be enabled by updating your 20_base_urls.pl to include
$c->{use_long_url_format} = 1;

If you apply these changes and would like Google Scholar to prioritise a reindex of your repository, get in touch with us and we’ll pass the message along to them.

Justin

Hi Jiadi,
I already have the changes applied with $c->{use_long_url_format} = 1; I only have to edit my sitemap generator so that it also shows the long url.

My question below is; How can we speed up the indexing in Google Scholar?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants