Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated some entries in the file COUNTER_Robots_list and added 1636 n… #62

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

CRMGB
Copy link

@CRMGB CRMGB commented Jan 25, 2024

Updated the robots list with 1636 new robots.

Most of the new entries have been gathered from https://github.com/monperrus/crawler-user-agents/blob/master/crawler-user-agents.json using using it as a guide and detecting new bots from our user-agents dataset.

This list is designed to be used as a REGEX pattern to identify crawlers/bots from our user-agent entries and exclude them from our metrics if detected.

The following files have been modified:

CHANGES.md
COUNTER_Robots_list.json
convert_to_txt
generated/COUNTER_Robots_list.txt

…ew entries, updated the CHANGES.md with the new bots and changes and corrections to the convert_to_txt file.
@alanorth
Copy link
Contributor

@CRMGB there is a ton of duplication in your proposed additions, for example:

Java/1.4.1_01
Java/1.4.1_04
Java/1.5.0_16
Java/1.6.0_04
Java/1.6.0_07
Java/1.6.0_18
Java/1.6.0_22
...

Not to mention, our list already has this much better regular expression:

^java\/\d{1,2}.\d

Also, your proposed additions have not escaped special characters like forward slashes and dots:

AHC/2.0
ASPSeek/1.2.5

And your patterns are strangely cut off at a certain line length or something?

Cookie%20Stumbler/22
CrawlConvera0.1 (Cra
Custo x.x (www.netwu
CyberSpyder Link Tes

Lastly, patterns like these would already be matched by the bot and spider patterns in our list:

purebot
Linguee Bot
voilabot
Baiduspider

So this pull request is not in good shape as is. There may be some new bot patterns we can use from the other project, but in such a large list it is very difficult to verify them. Personally, I would prefer to have bots that have been verified from access log files directly.

@CRMGB
Copy link
Author

CRMGB commented Jan 25, 2024

No problem @alanorth,
Thanks for the notes,
I will update accordingly,

Regards.

…t this regular expression: '^java\/\d{1,2}.\d'
…erns duplications. - Added the escape for special characters. - Added the right lenght for some patterns. - Deleted patters with a previous match. Updated CHANGES.md and the txt version respectively.
@CRMGB CRMGB force-pushed the Update-bots-and-new-aditions branch from 9eedfa3 to b22e6c6 Compare January 29, 2024 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants