Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new bots #34

Closed
wants to merge 9 commits into from
Closed

Add new bots #34

wants to merge 9 commits into from

Conversation

alanorth
Copy link
Contributor

@alanorth alanorth commented Feb 25, 2020

This adds patterns to match the following robot user agents:

  • Citoid
  • ecointernet
  • Typhoeus
  • 7siters
  • sqlmap
  • Pattern
  • GigablastOpenSource
  • OgScrper
  • Turnitin

Relevant URLs for each user agent included in the patterns file.

Typhoeus wraps libcurl in order to make fast and reliable requests.

See: https://github.com/typhoeus/typhoeus
The citoid node.js service generates citation data given a URL, DOI,
ISBN, PMID, PMCID or QID. It has a companion extension, Citoid, which
aims to provide use of the citoid service to VisualEditor.

See: https://www.mediawiki.org/wiki/Citoid
This is some kind of search engine or link aggregator. I have all
of the following user agents in my web server logs, totaling about
sixty thousand requests:

    ecolink (+https://search.ecointernet.org/)
    ecoweb (+https://search.ecointernet.org/)
    EcoInternet http://www.ecointernet.org/
    EcoInternet http://ecointernet.org/
    Biosphere EcoSearch http://search.ecointernet.org/

The user agent differs slightly, but the common ecointernet string
is always present.

See: https://ecointernet.org/
7siters is some kind of link and domain analysis database operating
their own spider.

See: https://7ooo.ru/siters/
sqlmap is an open source penetration testing tool that automates the
process of detecting and exploiting SQL injection flaws and taking over
of database servers.

---

This is definitely not a human user agent, and actually you would be
well advised to ban any IP address that made requests declaring this
user agent!

See: https://github.com/sqlmapproject/sqlmap
Pattern is a web mining module for the Python programming language.

---

This seems to be an academic spider. The spider's user agent looks
like this in my logs:

    Pattern/2.6 +http://www.clips.ua.ac.be/pattern

Because the word "Pattern" is not very unique and could appear in
some legitimate human user's user agent I suggest using anchoring
the string to the beginning of the line and matching at least one
digit for the version.

See: https://www.clips.uantwerpen.be/pattern
@alanorth
Copy link
Contributor Author

alanorth commented Apr 29, 2020

I just noticed a few thousand more hits from one of these bots and remembered that this pull request was never merged. Here is a gentle reminder!

This bot is used by the gigablast.com search engine. I have seen it
with the following user agent:

    GigablastOpenSource/1

As the user agent is sufficiently unique I don't think we need to
worry about handling the version number.

See: https://github.com/gigablast/open-source-search-engine
This user agent is responsible for tens of thousands of hits to my
web server over the last few years:

    OgScrper/1.0.0

There is very little information about this client on the web, but
it is included in several other robot detection libraries. As the
user agent is sufficiently unique I don't think we need to worry
about matching the version number.
alanorth added a commit to ilri/DSpace that referenced this pull request Apr 30, 2020
Import some of the spider agent patterns that I had submitted to
the COUNTER-Robots project a few months ago that are still pending
merges:
  - atmire/COUNTER-Robots#33
  - atmire/COUNTER-Robots#34
alanorth added a commit to ilri/DSpace that referenced this pull request Apr 30, 2020
Import some of the spider agent patterns that I had submitted to
the COUNTER-Robots project a few months ago that are still pending
merges:
  - atmire/COUNTER-Robots#33
  - atmire/COUNTER-Robots#34
Apparently the Turnitin.com plagiarism scanning service uses both
the TurnitinBot and Turnitin user agents. Right now COUNTER-Robots
does not block the second one.

See: https://turnitin.com/robot/crawlerinfo.html
@alanorth
Copy link
Contributor Author

Some of these have been merged, others not. I'll close this and re-submit the ones that haven't been merged.

@alanorth alanorth closed this Jul 20, 2020
@alanorth alanorth deleted the new-bots branch July 20, 2020 11:14
@alanorth alanorth mentioned this pull request Jul 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant