Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We should not produce output in a certain case #16

Closed
funilrys opened this issue Jan 2, 2019 · 29 comments
Closed

We should not produce output in a certain case #16

funilrys opened this issue Jan 2, 2019 · 29 comments

Comments

@funilrys
Copy link
Owner

funilrys commented Jan 2, 2019

We should not write or produce output if an element which is in the database is still ACTIVE or INVALID on retest.

@dnmTX said (anudeepND/blacklist#27 (comment)):

@funilrys i got your point but it makes me wonder what good are they doing in a folder that is design to collect invalid domains that came from the original lists during filtering.In our case here they're no longer present there(in the orig. lists).Maybe a sub folder for collecting a "old,no longer present invalid domains"? So they can pile up there and keep the main folder tight,with only the fresh ones.

@funilrys funilrys self-assigned this Jan 2, 2019
@funilrys funilrys changed the title We should not produce output in certain case We should not produce output in a certain case Jan 2, 2019
@funilrys
Copy link
Owner Author

funilrys commented Jan 2, 2019

On the other side, if it become ACTIVE, we should include it into the official ACTIVE list and maybe at the same time into a new analytic section/directory.

@dnmTX
Copy link

dnmTX commented Jan 2, 2019

in 99.5% of the cases(based on my observations for that particular lists-anudeepND) they are indeed INVALID.
For example:
2 ml.pubnative.net # there is a "2" and "space" in the front of the domain
url p.adsymptotic.com # there is a "url"(weird,i know) and "space" in the front
Is there a any chance those two and the rest,which are similar cases to EVER become valid?
Your script is doing great job finding them,the rest is up to you,either to dispose them or leave them on rotation which is more overhead for the whole filtering process.Anyway it takes two to three days to filter one lists,how about we start thinking how to reduce that time.

@anudeepND
Copy link

anudeepND commented Jan 3, 2019

@dnmTX @funilrys All these invalid domain came from a sub domain scanner I used months ago. I didn't take a closer look at the output, which made those domains to be ended up on the final list. Sorry for that :)
I will take closer look, line by line if possible to remove them.

Edit: The domains mentioned by @dnmTX was removed on December 23rd (anudeepND/blacklist@1a8f70e) along with other misspelled domains.

@dnmTX
Copy link

dnmTX commented Jan 3, 2019

@anudeepND i know.You mentioned it in the issue i opened in your repo. No apologies needed as i'm only trying to make you aware of the situation so you can do some cleaning.I'm not affected as i'm loading your lists from the Ultimate-Hosts-Blacklist,specifically the clean.list which is already filtered from anything possible(dead,invalid,whitelist and so on,and so on).
I would advise you to go by what is in the INVALID folder,see if you missed something and then check only the commits to that particular folder every week or so to see if anything new show up until @funilrys sort this one out and leave only the newcomers in the folder.

@anudeepND
Copy link

@dnmTX I have removed most of the domains present in INVALID section, but it's not updated/removed from INVALID section. (For example, I have removed domains ending with invalid TLD .col). How often does the INVALID section gets updated?

@dnmTX
Copy link

dnmTX commented Jan 3, 2019

@anudeepND it doesn't get updated at the present aka removing the old(non existent)entries and only leaving if there is a new ones.That's why i told you for now(at least) to check the "COMMITS" to that folder and see if anything new is added.This is what this post(issue) is about: me and @funilrys debating on what to do with all those domains that already has been removed from your original lists but still remain in the INVALID folder.So stand by as @funilrys got the last word here........

@funilrys
Copy link
Owner Author

funilrys commented Jan 4, 2019

Fix introduced with 61b3bdd.

It is now on the dev version.

@funilrys
Copy link
Owner Author

funilrys commented Jan 4, 2019

Let me explain what change since 13 minutes for everyone using the dev version and later this week for everyone else using the stable version.

Problematic

We were systematically generating outputs when retesting the content of the database subsystem. This caused some list to have INVALID and INACTIVE elements which are not anymore on the list.

Solution

I disabled the production of outputs (on file not on screen) for every element which is still INACTIVE and INVALID and at the same time already registered into the database.

That concretely means that for now, if the system retests an element which is the database, you'll get a friendly line like a normal test on screen but if the tested element is still INACTIVE or INVALID you'll not get anything in the generated data.

Side note

Please be aware: If the algorithm/system/script changes because of something like #17 (sorry 😭), a new web practice or a new RFC, some of those domains may become ACTIVE in the future.

If it is the case we put the newly ACTIVE domain in the official {domains, json, hosts}/ACTIVE/* lists and at the same time we write it into the Analytic/SUSPICIOUS/* files so that you can keep a track about what changes.

If you use outputs from @dead-hosts or @Ultimate-Hosts-Blacklist it is not a problem as they generate a clean.list which only contain the elements which are ACTIVE but if you use PyFunceble as a "standalone" sub-system/script/module you should keep in mind that such changes can happen.

If there is any question please let me know.

Cheers,
Nissar

@dnmTX
Copy link

dnmTX commented Jan 4, 2019

Looks GOOD and...no...no questions 😉

@anudeepND
Copy link

@funilrys I looked at the latest commit, the output doesn't contain INVALID list, which means everything's good?

funilrys added a commit that referenced this issue Jan 5, 2019
It's always great to get feedback :)

Fix:

- of the test/check of DNS names which ends with `.` (#17).
- of the `--filter` argument which was not working if a special
character was given.
- of the way we clean and construct the list we have to test. (#18)

Review:

- of the way we check syntaxes.
- of the way we produce outputs for elements which are already
registered into the database. (#16)
- of the way we remove an element from the database.
- of the way we merge upstream with the local configuration file.

Contributors:

- @dnmTX
- @jawz101
- @speedmann
@funilrys
Copy link
Owner Author

funilrys commented Jan 5, 2019

@anudeepND yes, it's the expected behavior 👍

@dnmTX
Copy link

dnmTX commented Jan 5, 2019

Would advise to keep this one OPEN and monitor it for couple of weeks.

@anudeepND the filtering just started so lets wait until it's done before make any conclusions.
How to check? Go to info.json and there is a entry there:
"currently_under_test": 1=still filtering 0=done with the filtering
Usually takes 2 to 3 days.

@funilrys
Copy link
Owner Author

funilrys commented Jan 5, 2019

@dnmTX will keep this closed until it is not the case 😸
I'm already on the monitoring since I pushed it to the dev version 😉
For now, I can tell that it is working!

@dnmTX
Copy link

dnmTX commented Jan 6, 2019

Ok...due to @anudeepND already removed all(assuming) invalid ones and there is no way to know if the new changes are working i was monitoring different lists-justdomains_....On which the domains.list hasn't been updated for 29 days so far and in that lists's INVALID folder there were two domains sitting there for a long time(which are also in the domains.list as well,sorry but don't really remember the exact ones).Now,the filtering just finished and the INVALID folder is empty,means,that something is not right.
Those two domains should've show up after each filtering because they were never removed from the domains.list aka the original lists.

@dnmTX
Copy link

dnmTX commented Jan 6, 2019

Here you go,i found them.Those two should've show up in the INVALID folder after the last filtering:
autosave

On the UP side filtering cycle is much much faster,the invalid domains are filtered indeed and by the look of it ACTIVE/hosts doesn't get filled with duplicates anymore.
Downside is that if there is any invalid domains in the feature they will not be placed in the INVALID folder until the next filtering.

@funilrys funilrys reopened this Jan 7, 2019
@dnmTX
Copy link

dnmTX commented Jan 7, 2019

@funilrys i'm not sure you are familiar but just to point it out:
This one has no domains in it 😶
This one has only two(2) 🤔

@funilrys
Copy link
Owner Author

funilrys commented Jan 8, 2019

Hi @dnmTX, I'm aware of that 😸

@Ultimate-Hosts-Blacklist is open for everybody who wants to have their own repository and at the same being included into https://github.com/mitchellkrogza/Ultimate.Hosts.Blacklist 😄
Please create an issue or create an internal discussion at @Ultimate-Hosts-Blacklist as @smed79 is responsible of those mentioned repository! We only provide the infrastructure!

Cheers,
Nissar

@dnmTX
Copy link

dnmTX commented Jan 8, 2019

Please create an issue or create an internal discussion at @Ultimate-Hosts-Blacklist as @smed79 is responsible of those mentioned repository! We only provide the infrastructure!

I'll wait awhile to see if @smed79 respond to the discussion here.If he doesn't,i will.

@smed79
Copy link
Contributor

smed79 commented Jan 8, 2019

Hi,
propellerads revolving adservers added ==> domains.list
for the "assorted" repo, I do not remember that I am at the origin of creating that repo.

thank you for notifying.

@smed79
Copy link
Contributor

smed79 commented Jan 8, 2019

for the "assorted" repo, I do not remember that I am at the origin of creating that repo.

I do a search in my email-box and find the below conversations about cliqz.com
https://git.io/fhZUB
Ultimate-Hosts-Blacklist/Ultimate.Hosts.Blacklist@fbad015

It's unclear for me what this repo will inclut, so I am not going to maintain it.

@funilrys I sent you a request via email about creating a repo with the purpose of blocking "popads revolving ad servers".

Thank you.

@funilrys
Copy link
Owner Author

funilrys commented Jan 8, 2019

Hi @smed79 the assorted repo was created when we migrated to the actual infrastructure. I will leave it as it is until I talked with Mitchell.

Will do asap @smed79 👍

@funilrys
Copy link
Owner Author

funilrys commented Jan 8, 2019

Fix of @dnmTX last report (#16 (comment)) introduced with 784ad72.

It is now in the dev version.

P.S: As @Ultimate-Hosts-Blacklist now use the master/stable version it will be effective from the time this issue will be reclosed automatically.

funilrys added a commit that referenced this issue Jan 9, 2019
Improvements, fixes and reviews!

Fix:

- of the way we handle and work with the database when we are used
as an imported module.
- of the way we generate file when retesting element which are
in the database.
    - Indeed the last release did not provide a full implementation
    as we stopped to generate files even if the tested element is
    still on the list we are testing.
- of `--no-files`|`no_files` rules which were not respected when
`--json`|`generate_json` were used/activated.

Review:

- of the path/location of the `iana-domains-db.json` file.
- of the way we handle empty file or list to test.

Improvement:

- of the way we initiate/save/call global information which is not related
to the CONFIGURATION.
    - Indeed, we were putting everything in the CONFIGURATION variable.
    Now there is CONFIGURATION for the configuration and INTERN for
    everything else.

Introduction:

- of more scenarios for CI tests.

Contributors:

- @dnmTX (#16)
- @jawz101 (#18)
@dnmTX
Copy link

dnmTX commented Jan 10, 2019

@funilrys for some reason in smed79_propellerads_adservers there is no clean.list. Could be a bug.

@dnmTX
Copy link

dnmTX commented Jan 10, 2019

@smed79 i've been meaning to ask you. All your ads lists(admeasures_adservers,getadmiral.com and so on),which countries/locations they are most relevant to? Trying to decide which ones to use but really don't need the "extra weight" if you know what i mean.Anything North America would be my first choice.
@funilrys sorry,i know it's not really relevant to the subject(s) here,but anyway,it's all get mixed up in this post here mines as well ask away.

@smed79
Copy link
Contributor

smed79 commented Jan 10, 2019

I agree with you, too many lists is confusing... 😕

They are relevant for ALL countries/locations especially for those users who visit streaming, torrent or adult sites. The malvertising ad networks that you have mentioned are using rotating domains trying to escape ad blockers. For that reason strict blocking is applied by EasyList for some sites (e.g. #p130918).

getadmiral list has the purpose of blocking the anti adblock wall https://vgy.me/1mqjc4.jpg
some sites here where it is used https://ghostbin.com/paste/uzocj/raw

PS:

  • I am planning to maintain another list which will block ad-maven.com revolving adservers.
  • when I will have some free time, I am going to merge the lists mentioned in one full list.

@dnmTX
Copy link

dnmTX commented Jan 10, 2019

@smed79 thank you for that informative answer.Looks like at one point or another they are all relevant to me( streaming, torrent or adult sites) so i guess i'll load them all.
Thank you for all the lists you providing and for the great job maintaining them 👍

when I will have some free time, I am going to merge the lists mentioned in one full list.

That would be great,i guess i'd better wait till then.

@dnmTX
Copy link

dnmTX commented Jan 11, 2019

Yep,looks like we are on track here. @anudeepND check out the newcomers in your INVALID folder. 🙂

@funilrys are any of the latest changes/fixes applied to Dead Hosts repo?
I was monitoring lightswich05 and it was stuck on filtering for two days.
Wondering if i have to move there and start...."inspecting" 😋

@funilrys
Copy link
Owner Author

@dnmTX: @dead-hosts use the dev version by default but members/maintainers can switch to the stable if they want 😄

Sooo, yes any changes here are there too 😉

@dead-hosts is actually the first place to use PyFunceble... You don't even have to think about how to use PyFunceble if your list are tested at @dead-hosts 😄

@dnmTX
Copy link

dnmTX commented Jan 11, 2019

@dnmTX: @dead-hosts use the dev version by default but members/maintainers can switch to the stable if they want 😄

PING @lightswitch05 !!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants