Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

search index #7652

Closed
DVeris opened this issue Feb 24, 2015 · 31 comments
Closed

search index #7652

DVeris opened this issue Feb 24, 2015 · 31 comments
Assignees
Labels
Milestone

Comments

@DVeris
Copy link

DVeris commented Feb 24, 2015

@fatcrobat
Copy link

I tried that on the contao.org website, and that has to be fixed.

How to reproduce:

  1. Search for "Impressum"
  2. Count the results for Impressum
  3. Open a new tab and open the Impressum page with a random suffix like: https://contao.org/de/impressum/x12e12312c3213c123c.html.
  4. Search for "Impressum"
  5. The count for Impressum should be increased by the one we pseudo-created with step 3.

The Problem is within the checksum check in \Contao\Search::indexPage on Line 146.

// Calculate the checksum (see #4179)
$arrSet['checksum'] = md5(preg_replace('/ +/', ' ', strip_tags($strContent)));

// Return if the page is indexed and up to date
$objIndex = $objDatabase->prepare("SELECT id, checksum FROM tl_search WHERE url=? AND pid=?")->limit(1)->execute($arrSet['url'], $arrSet['pid']);

if ($objIndex->numRows && $objIndex->checksum == $arrSet['checksum'])
{
return false;
}

The checksum of the pages https://contao.org/de/impressum.html and https://contao.org/de/impressum/x12e12312c3213c123c.html is not the same, caused by relative links within the document.

Example from Custom navigation here:

<!-- indexer::stop -->
<nav class="mod_customnav block">
<a href="de/impressum/x12e12312c3213c123c.html#skipNavigation41" class="invisible">Navigation überspringen</a>
<ul class="level_1" role="menubar">
<li class="button_filled first"><a href="de/anmelden.html" title="Verwaltung des Benutzerkontos" class="button_filled first" role="menuitem">Anmelden</a></li>
<li class="button_unfilled last"><a href="https://github.com/contao/" title="Fehler melden" class="button_unfilled last" target="_blank" role="menuitem">Fehler melden</a></li>
</ul>
<a id="skipNavigation41" class="invisible">&nbsp;</a>
</nav>
<!-- indexer::continue -->

Solution: Always strip out the current url from $strContent before creating the checksum.

@leofeyer
Copy link
Member

I agree that this behavior should be changed, however I don't think that stripping the URL is the correct solution. An URI string with variables might lead to a different output and should thus be indexed separately.

@contao/developers /cc

@Toflar
Copy link
Member

Toflar commented Feb 19, 2016

Why does a random suffix not result in a 404 in the first place? These nonsense pages shouldn't even get indexed?

@leofeyer
Copy link
Member

This is because the random string is treated as auto_item:

https://contao.org/de/impressum/x12e12312c3213c123c.html

Therefore the "unused GET parameters" check does not respond with a 404 error.

@fatcrobat
Copy link

Maybe the $strContent variable should only contain content that is not between

<!-- indexer::stop -->
<!-- indexer::continue -->

comments?

@Toflar
Copy link
Member

Toflar commented Feb 19, 2016

@fatcrobat you could still have page-relative links outside of those comments which would not change anything in that case then.
@leofeyer but in that case, somebody must have called Input::get('auto_item') no? So if nobody ever called auto_item, nobody apparently handled it --> 404?

@fatcrobat
Copy link

Yes you are right, invalid requests should always return in an 404.
Yanick Witschi notifications@github.com schrieb am Fr., 19. Feb. 2016 um
20:02:

@fatcrobat https://github.com/fatcrobat you could still have
page-relative links outside of those comments which would not change
anything in that case then.
@leofeyer https://github.com/leofeyer but in that case, somebody must
have called Input::get('auto_item') no? So if nobody ever called auto_item,
nobody apparently handled it --> 404?


Reply to this email directly or view it on GitHub
#7652 (comment).

@leofeyer
Copy link
Member

I'm sure it is called somewhere in the regular program flow, like in Frontend::getPageIdFromUrl() or so.

@Toflar
Copy link
Member

Toflar commented Feb 19, 2016

It should set it back to unused then if it did not use it? I don't get why an invalid URL can be generated?

@leofeyer leofeyer added this to the 3.5.9 milestone Mar 8, 2016
@leofeyer
Copy link
Member

Let me sum this up:

The only way to "solve" this issue, is not to index URLs with a query string at all.

@Toflar
Copy link
Member

Toflar commented Mar 18, 2016

The only way to "solve" this issue, is not to index URLs with a query string at all.

Imho, that's correct. Why would you index e.g. a gallery.html?page=5? If somebody wants to index a page, it has to be a real page. By definition (at least for me, SEO friends may correct me), a real page does not have any query string. Query parameters are used for interaction/filtering stuff.

However, as long as we support ID URL's like domain.tld?id=42 at least the parameter id should be preserved. But only once.

@asaage
Copy link

asaage commented Mar 18, 2016

But those pages usually display the teaser of news or events and the words contained in the teaser might not show up in the detail/full view.
Therefore not indexing pages with query-strings like newslist.html?page=5 would make me sad..

@fritzmg
Copy link
Contributor

fritzmg commented Mar 19, 2016

@asaage Newslists are not indexed by default anyway. They have <!-- indexer::stop -->. News that do not have default defined as their source are not indexed at all currently.

Related: #6942

@Toflar
Copy link
Member

Toflar commented Mar 19, 2016

Even if you removed the indexer comments, they have never ever in the whole history of Contao/TYPOlight been indexed at all. If page or page_* is present in the URL, the page was never indexed. Same applies for id (which is pretty funny becaue that means when you don't use speaking URL's, you'll never have anything indexed. Shows that apparently nobody - for good reasons - is even using this), file, token, day, month, year and PHPSESSID. Plus whatever developers added to $GLOBALS['TL_NOINDEX_KEYS'].

@Toflar
Copy link
Member

Toflar commented Mar 19, 2016

Also, the fact that day, month and year were always excluded clearly shows that they are used for filtering the newslist. Query-Strings do not belong to the search index. Simple as that. If you want your newsteaser to be indexed, it should also be present on the detail view of the news itself and hidden using CSS or whatever. Everything else is wrong.

@asaage
Copy link

asaage commented Mar 19, 2016

agree... the teaser is added to the metadesciption on the detailpage.
but in some cases users are fine with just teaser-list's.
In those cases i removed <!-- indexer::stop --> hoping for the pages to be indexed - but i never verified this.

@leofeyer
Copy link
Member

Same applies for id (which is pretty funny becaue that means when you don't use speaking URL's, you'll never have anything indexed. Shows that apparently nobody - for good reasons - is even using this)

😄

@fritzmg
Copy link
Contributor

fritzmg commented Mar 20, 2016

Query-Strings do not belong to the search index.

This however means that things like figcaptions etc. from pages other than the first page in paginated galleries for example also cannot be found via Contao's search module as discussed in #6942. In order to capture those, the new search indexer (#242) would have to crawl URLs containing query strings as well, would it not?

@Toflar
Copy link
Member

Toflar commented Mar 20, 2016

Gallery pagination is yet another example of doing something wrong. I don't see any valid point for offering pagination for galleries. Besides the fact that it is the wrong approach from a URI point of view it is also very user unfriendly because users cannot (on mobile phones) continue to swipe but are interrupted by a page load.

Using images basically happens in two different scenarios:

a) They provide more insight/additional info to existing content.
b) The images itself, are the content. E.g. when you are a photographer and you want to show your work.

In scenario a) obviously they are placed on that existing content page (be it a news entry or a regular page) and thus indexed automatically.
In scenario b) you should have a unique URL for every image. E.g. you should have domain.tld/galleries/landscape/river.html which means they can be indexed again.

Rule of thumb: Anything that should be indexed should have a permanent URL.

@fritzmg
Copy link
Contributor

fritzmg commented Mar 20, 2016

Rule of thumb: Anything that should be indexed should have a permanent URL.

Agreed. But on that note I want to point out again that with the current news (and calendar) module's implementation, not all news entries are indexed - only news entries with source = 'default' are indexed.

Though regarding the original point about pagination - assuming that it would make sense, for whatever reason, to index paginated content. Maybe paginations in Contao should be reworked in general? With a general pagination controller or something similar with which the following conditions would be implemented:

  • only allowing one pagination element per contao page
  • creating pagination URLs like domain.tld/foo/page/2 or domain.tld/foo/page2.html instead of domain.tld/foo.html?page=2

This way the search indexer can truly omit any URL consisting of query strings - and meta tags like rel="next" and rel="prev" (#3515, #8022) (or a canonical tag with an URL that shows the whole content) could be reintroduced again. This would be useful in general, regardless of the Contao search indexer.

@Toflar
Copy link
Member

Toflar commented Mar 21, 2016

Though regarding the original point about pagination - assuming that it would make sense, for whatever reason, to index paginated content.

If anyone can provide a real use case why indexing paginated content, we can discuss that matter. For now, I cannot see any reason at all.

creating pagination URLs like domain.tld/foo/page/2 or domain.tld/foo/page2.html instead of domain.tld/foo.html?page=2

This is exactly what I want to avoid because it's completely incorrect.

@fritzmg
Copy link
Contributor

fritzmg commented Mar 21, 2016

This is exactly what I want to avoid because it's completely incorrect.

Why is it incorrect? Assuming there is a use case, where you want paginated content to be indexed.

@Toflar
Copy link
Member

Toflar commented Mar 21, 2016

We should not support something that is wrong even if the users want to do it. This use case does not exist for me.

@fritzmg
Copy link
Contributor

fritzmg commented Mar 21, 2016

I see, you meant it's "wrong" since there is no usecase (yet) that makes sense.

@Toflar
Copy link
Member

Toflar commented Mar 21, 2016

Yes. And I'm pretty much convinced that there never will be any use case where indexing paginated content makes sense. Even if it did make sense to the user, it's most likely still wrong because it violates the URI principle. Every resource should have it's own unique URI. Paged content is mixed content and does not belong to the search index :)

@fritzmg
Copy link
Contributor

fritzmg commented Mar 21, 2016

Well, one use case is to index Contao news articles or events which have a source other than default. But that would be a workaround of a different issue.

@backbone87
Copy link
Contributor

restricting search to just URLs without a QS wouldnt make much sense imho.
without diving too deep into the search code of contao, i would use the following pseudo algorithm:

function index($url, $content) {
  $tokens = tokenize($content);
  $hash = hash($tokens);
  if(!index_contains($hash)) {
    index_add($url, $tokens);
    return;
  }
  $existingURL = index_url($hash);
  if(is_better_url($url, $existingURL)) {
    index_replace($url, $existingURL);
  }
}
  • index indexes the given contents of the given url.
  • tokenize generates a token stream of the given contents: cleans out indexer::stop parts, strips html tags, strips unindexable chars, etc. (improvements like stemming could be applied here) extension point
  • hash generates a hash of the given token stream
  • index_contains checks if the given token stream hash was already indexed (under any URL)
  • index_add actually adds the tokens to the DB and associates them with the URL
  • index_url returns the URL that the given token stream hash is associated with
  • is_better_url checks if the given URL is better than the given other (existing) URL: better in a sense that it: is shorter, is normalized, etc. (here we can tweak which URL to actually index) extension point
  • index_replace replaces the given existing URL with the given URL

this would guarantee, that identical content (from a searchers PoV) is always only listed in the search results under one specific URL and that this URL is the best URL known

@backbone87
Copy link
Contributor

if paginated page should be indexed or not, can and should be decided by the user on a page by page basis with the options already present in the page edit view.

@leofeyer leofeyer modified the milestones: 3.5.9, 3.5.10 Mar 21, 2016
@leofeyer
Copy link
Member

We have debugged the issue in our Mumble call on March 24th. It is actually caused by the change language extension. If I disable it, the URL https://contao.org/de/impressum/x12e12312c3213c123c.html correctly triggers the 404 page.

@leofeyer
Copy link
Member

I have created a ticket: terminal42/contao-changelanguage#68

@leofeyer
Copy link
Member

The other issue should be fixed in 5030ccf. @contao/developers Please review carefully :)

jsonn pushed a commit to jsonn/pkgsrc that referenced this issue Apr 23, 2016
### 4.1.3 (2016-04-22)

 * Use data URIs for the image preview in the back end.
 * Use DIRECTORY_SEPARATOR to convert kernel.cache_dir into a relative path (see #464).
 * Always trigger the "isVisibleElement" hook (see contao/core#8312).
 * Do not change all sessions when switching users (see contao/core#8158).
 * Do not allow to close fieldsets with empty required fields (see contao/core#8300).
 * Make the path related properties of the File class binary-safe (see contao/core#8295).
 * Correctly validate and decode IDNA e-mail addresses (see contao/core#8306).
 * Skip forward pages entirely in the book navigation module (see contao/core#5074).
 * Do not add the X-Priority header in the Email class (see contao/core#8298).
 * Determine the search index checksum in a more reliable way (see contao/core#7652).
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

7 participants