Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated TZ test list #520

Closed
wants to merge 6 commits into from
Closed

Updated TZ test list #520

wants to merge 6 commits into from

Conversation

@agrabeli
Copy link
Collaborator

agrabeli commented Sep 30, 2019

This PR includes:

  • Clean up of TZ test list => I manually checked all of the URLs and deleted the ones that were squatted, re-purposed, or otherwise no longer operational.
  • Updated URLs
  • Fixed category codes
  • Added new URLs based on desktop research
agrabeli added 2 commits Sep 30, 2019
@anadahz

This comment has been minimized.

Copy link
Contributor

anadahz commented Oct 1, 2019

@agrabeli Please have a look at #496 as it contains a lot of information and URLs specific to the Tanzanian test-list.

@agrabeli agrabeli mentioned this pull request Oct 5, 2019
@agrabeli

This comment has been minimized.

Copy link
Collaborator Author

agrabeli commented Oct 5, 2019

@anadahz thanks for the heads-up! I wasn't aware of Sylvia's PR (particularly since it had been merged and the latest version of the TZ test list suggested that it hadn't been updated since 2014). It seems that @hellais has reverted the commit and pushed to master (as you requested), and he's asking that you please re-open a PR with Sylvia's updates.

@agrabeli

This comment has been minimized.

Copy link
Collaborator Author

agrabeli commented Oct 5, 2019

@anadahz In my latest commit (820a9e9) I have added all of @Sylviakanari 's updates based on PR #496 and your feedback (i.e. I have addressed your feedback in the commit).

@anadahz

This comment has been minimized.

Copy link
Contributor

anadahz commented Oct 6, 2019

@agrabeli Thank you for working on this PR!

There are still some duplicate domains and other entries that are already included in the global test-list. Please find below the list of duplicates:

@agrabeli

This comment has been minimized.

Copy link
Collaborator Author

agrabeli commented Oct 6, 2019

@anadahz since this PR passes the Travis CI check, that probably means that some of the duplicates you are referring to are not duplicates in terms of URLs (but duplicates of domains). The Travis script automatically detects the duplicates (between local & global test lists) and flags them (and I have made deletions based on that).

@anadahz

This comment has been minimized.

Copy link
Contributor

anadahz commented Oct 7, 2019

@agrabeli There are quite some duplicates URLs as well as URLs that are already included in the global test-list

For instance the following are the exact same URLs only difference is with HTTP or HTTPS:

https://www.thecitizen.co.tz/,NEWS,News Media,2014-04-15,citizenlab,
and
http://thecitizen.co.tz/,NEWS,News Media,2019-07-18,SKM,

As a good measure and to save volunteers bandwidth we should ensure that the test-lists are having the minimum possible URLs/domains except if there is a very specific reason to include same/duplicate entries/domains/URLs. Given that the Tanzanian test-list has 256 entries it may not be a bad idea to remove some (from these same/duplicate) entries, after all most OONI Probe versions are testing both the HTTP and the HTTPS (if available) versions of the entries.

@agrabeli

This comment has been minimized.

Copy link
Collaborator Author

agrabeli commented Oct 11, 2019

Thank you @anadahz for the feedback. In my latest commit, I deleted URLs based on the duplicates you pointed out (excluding a few URLs that I think are probably worth testing, even though their domains are included in the global list). If you identify any other duplicates, perhaps open a PR proposing their deletion once this PR has been merged (since it seems you can automatically detect them with your script)?

You raise an interesting point. The Travis CI checks are configured to check for duplicate URLs (within the test lists & between the global and country-specific test lists), but not for duplicate domains (presumably your script checks for duplicate domains?).

On the one hand, we probably don't want to flag all duplicate domains for deletion, as there are cases where we may want to check different HTTP versions of sites (to check the blocking of different web-pages). But on the other hand, we can end up with the problem (as emerged, in fact, in this test list PR) where both the HTTP and HTTPS version of a single domain is added (whether in the same test list, or in both the global and country-specific test list), which is redundant (given that OONI Probe tests for both HTTP and HTTPS anyway, where possible).

@hellais @sneft Any thoughts on how to solve this problem? Perhaps we can add new rules to the Travis script?

@anadahz

This comment has been minimized.

Copy link
Contributor

anadahz commented Oct 14, 2019

Thank you @anadahz for the feedback. In my latest commit, I deleted URLs based on the duplicates you pointed out (excluding a few URLs that I think are probably worth testing, even though their domains are included in the global list). If you identify any other duplicates, perhaps open a PR proposing their deletion once this PR has been merged (since it seems you can automatically detect them with your script)?

Thank you for working on this @agrabeli .

I have listed all the duplicates in #520 (comment) .
I don't have a script, this is manual process that (usually) ensures that are no mistakes. I manually checked the test-list using the commands grep, cut and sort and then inspecting the differences, then double checked and by using the search function of the browser (shortcut: ctrl -f) I could get the line numbers of the test-list in pull requests and pasted them in #520 (comment)

Manually inspection is one of the best processes to eliminate most human errors in test-lists.

You raise an interesting point. The Travis CI checks are configured to check for duplicate URLs (within the test lists & between the global and country-specific test lists), but not for duplicate domains (presumably your script checks for duplicate domains?).

I guess this may trigger a lot of errors but I may be wrong and a regular expression may find all/most duplicate entries. Neverthelless a good test-list review should happen during the creating, addition and merging of entries in the URL list. Again I don't have any script and I can't think of such a script that may be useful in this case.

On the one hand, we probably don't want to flag all duplicate domains for deletion, as there are cases where we may want to check different HTTP versions of sites (to check the blocking of different web-pages). But on the other hand, we can end up with the problem (as emerged, in fact, in this test list PR) where both the HTTP and HTTPS version of a single domain is added (whether in the same test list, or in both the global and country-specific test list), which is redundant (given that OONI Probe tests for both HTTP and HTTPS anyway, where possible).

IMHO I 'll prefer as less same URL/domains entries as possible in the test-lists. Of course this depends on the importance of specific URL entries. Currently the TZ test-list has more than 200 entries, so perhaps it may be more useful to keep only unique domains in order to reduce the size of the test-list.

@agrabeli It seems that you have accidentally committed the UG test-list here. I think it's better if you could add it on a separate pull request.

@agrabeli agrabeli mentioned this pull request Oct 22, 2019
@agrabeli

This comment has been minimized.

Copy link
Collaborator Author

agrabeli commented Oct 22, 2019

@anadahz I have opened a separate PR for the UG test list update (and removed it from here): #534

@hellais

This comment has been minimized.

Copy link
Collaborator

hellais commented Jan 8, 2020

I rebased and merged this into master.

@hellais hellais closed this Jan 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.