redirecting to the more detail website

algolia · Jul 18, 2018 · b6f1c7e · b6f1c7e
1 parent b693f08
commit b6f1c7e
Showing 1 changed file with 1 addition and 280 deletions.
diff --git a/README.md b/README.md
@@ -2,8 +2,6 @@
 
 This is the repository hosting the public [DocSearch](https://community.algolia.com/docsearch/) configurations.
 
-**Update** You can check [the DocSearch dedicated documentation website](https://community.algolia.com/docsearch/documentation/docsearch-scraper/config-options/)
-
 DocSearch is composed of 3 different projects:
 * The front-end JavaScript library: https://github.com/algolia/docsearch
 * The scraper which browses & indexes web pages: https://github.com/algolia/docsearch-scraper
@@ -31,282 +29,5 @@ Once you run the DocSearch scraper on a specific configuration, it will:
  - extract the content of every single crawled page following the logic you defined using the CSS selectors
  - push the resulting records to the Algolia index you configured
 
-## Configuration format
-
-A configuration file looks like:
-
-```json
-{
-    "index_name": "stripe",
-    "start_urls": [
-        "https://stripe.com/docs"
-    ],
-    "stop_urls": [
-        "https://stripe.com/docs/api"
-    ],
-    "selectors": {
-      "lvl0": "#content header h1",
-      "lvl1": "#content article h1",
-      "lvl2": "#content section h3",
-      "lvl3": "#content section h4",
-      "lvl4": "#content section h5",
-      "lvl5": "#content section h6",
-      "text": "#content header p,#content section p,#content section ol"
-    },
-    "selectors_exclude": [
-        ".method-list",
-        "aside.note"
-    ],
-    // additional options
-    [...]
-}
-```
-
-### `index_name` ***Mandatory***
-
-Name of the Algolia index where all the data will be pushed. If the `PREFIX` environment variable is defined, it will be prefixed
-with it.
-
-*This name must be equal to the configuration file name*
-
-### `start_urls` ***Mandatory***
-You can pass either a string or an array of urls. The crawler will go to each
-page in order, following every link it finds on the page. It will only stop if
-the domain is outside of the `allowed_domains` or if the link is blacklisted in
-`stop_urls`.
-Strings will be considered as regex.
-
-Note that it currently does not follow 301 redirects.
-
-### `selectors` ***Mandatory***
-
-This object contains all the CSS selectors that will be used to create the
-record hierarchy. It contains 6 levels (`lvl0`, `lvl1`, `lvl2`, `lvl3`, `lvl4`,
-`lvl5`) and `text`. You should fill at least the three first levels for better
-relevance.
-
-A default config would be to target the page `title` or `h1` as `lvl0`, the `h2`
-as `lvl1` and `h3` as `lvl2`. `text` is usually any `p` of text.
-
-## Sitemap crawling ***Optional***
-
-Our crawler offers you to crawl a site by discovering the URLs using Sitemaps. Thus, you will need to define the direct url(s) to your sitemap XML file, `sitemap_urls` , and then establish regex(s), `sitemap_urls_regexs` , which will match the URLs to crawl. Otherwise it will use the `start_urls` pattern in order to match the URLs available within the site map.
-
-For sites that use Sitemap index files that point to other sitemap files, all those sitemaps will be followed.
-
-###  `sitemap_urls`
-A list of urls pointing to the sitemaps (or sitemap index) you want to crawl. Must be provided if you want to discover though sitemap.
-
-###  `sitemap_urls_regexs`
-A list of regular expression that will be applied to each URL from the sitemap. If the pattern match an URL, this link will be scrapped. If none regular expressions is defined, the start_urls will be taken as pattern.
-
-###  `force_sitemap_urls_crawling`
-Specifies if matched URL should not respect the same rules as the hyperlink crawled. If set to true, each URL will be scraped no matter if it suited the `start_urls` or `stop_urls`. Default is `force_sitemap_urls_crawling` disabled
-
-#### Example
-```json
-[...]
-"sitemap_urls": [
-    "https://www.mySite.com/sitemap.xml"
-  ],
-"": [
-    "/doc/"
-  ],
-"force_sitemap_urls_crawling": true,
-[...]
-```
-Given this configuration, each webpage whose the URL contains '/doc/' will be scrapped even if they don't complied the `start_urls` or `stop_urls`
-
-
-### Global selectors
-
-It's possible to make a selector global which means that all records for the page will have
-this value. This is useful when you have a title that is in the right sidebar and
-the sidebar is placed after the content in the dom.
-
-```json
-"selectors": {
-  "lvl0": {
-    "selector": "#content header h1",
-    "global": true
-  }
-}
-```
-
-### Xpath selector
-
-By default `selectors` is considered to be a [css selector](https://www.w3schools.com/cssref/css_selectors.asp) but you can specify that a selector is a [xpath query](https://www.w3schools.com/xml/xpath_intro.asp).
-This is useful when you want to do more complex selection like selecting the parent of a node.
-
-```json
-"selectors": {
-  "lvl0": {
-    "selector": "//li[@class=\"chapter active done\"]/../../a",
-    "type": "xpath"
-  }
-}
-```
-
-### Default value
-
-You have the possibility to add a default value. If the given selector doesn't match anything in a page
-then for each record the default value will be set
-
-```json
-"selectors": {
-  "lvl0": {
-    "selector": "#content article h1",
-    "default_value": "Documentation"
-  }
-}
-```
-
-### Strip Chars
-
-You can override the default `strip_chars` per level
-
-```json
-"selectors": {
-  "lvl0": {
-    "selector": "#content article h1",
-    "strip_chars": " .,;:"
-  }
-}
-```
-
-### `allowed_domains`
-
-You can pass an array of strings. This is the whitelist of
-domains the crawler will browse. If a link targets a page that is not in the
-whitelist, the crawler will not follow it.
-
-Default is the domain of the first element in the `start_urls`
-
-### `stop_urls`
-
-This is the blacklist of urls on which the crawler should stop. If a link in
-a crawled webpage targets one the elements in the `stop_urls` list, the crawler
-will not follow the link.
-
-Note that you can use regexps as well as plain urls.
-
-Note: It is sometimes needed to add `http://www.example.com/index.html` pages to
-the `stop_urls` list if you set `http://www.example.com` as a `start_urls`, to
-avoid duplicated content.
-
-### `selectors_exclude`
-
-By default, the `selectors` search is applied page-wide. If there are some parts
-of the page that you do not want to include (like a header, sidebar or footer),
-you can add them to the `selectors_exclude` key.
-
-### `custom_settings`
-
-This object is any custom Algolia settings you would like to pass to the index
-settings.
-
-### `min_indexed_level`
-
-Lets you define the minimum level at which you want records to be indexed. For
-example, with a `min_indexed_level: 1`, you will only index records that have at
-least a `lvl1` field.
-
-This is especially useful when the documentation is split into several pages,
-but all pages duplicates the main title (see [this issue][1]).
-
-### `js_render`
-
-The HTML code that we crawl is sometimes generated using Javascript. In those
-cases, the `js_render` option must be set to `true`. It will enable our
-internal proxy (Selenium) to render pages before crawling them.
-
-This parameter is optional and is set to `false` by default.
-
-### `js_wait`
-
-The `js_wait` parameter lets you change the default waiting time to render the
-webpage with the Selenium proxy.
-
-This parameter is optional and is set to `0`s by default.
-
-### `use_anchors`
-
-The `use_anchors` is need to be set to True for javascript doc when the hash is
-used to route the query. Internally this will disable the canonicalize feature that
-is removing the hash from the url.
-
-This parameter is optional and is set to False by default.
-
-### `strip_chars`
-
-A list of character to remove from the text that is indexed.
-
-Default is `" .,;:§¶"`
-
-### `scrape_start_urls`
-
-Default is `false`
-
-### `remove_get_params`
-
-Default is `false`
-
-### `strict_redirect`
-
-Default is `false`
-
-### `nb_hits`
-
-Each time the configuration is locally run, this attribute is set to the number of records indexed.
-
-This attribute is used for purposed monitoring. We keep a track of its evolution in order to detect main changes.
-
-Default is `0`.
-
-## Possible issues
-
-#### Duplicated content
-
-It could happen that the crawling populates duplicated data from your website. This is mostly because we have crawled the same page several times (e.g. from different urls). If we have URLs like `http://website.com/page` and `http://website.com/page/` (notice the trailing `/` from the second one), the scraper will consider them as different. This can be fixed by adding a regex to the `stop_urls` in your `config.json`:
-
-```json
-"stop_urls": [
-  "/$"
-]
-```
-
-In this attribute, you can also list the pages you want to skip:
-
-```json
-"stop_urls": [
-  "http://website.com/page/"
-]
-```
-
-#### Anchors
-
-The scraper will also consider pages with anchors as different pages. Make sure you remove any hashsign from the urls you put in the stop & start URLs:
-
-*Bad:*
-
-```json
-"stop_urls": [
-  "http://website.com/page/#foo"
-]
-```
-
-*Good:*
-
-```json
-"stop_urls": [
-  "/$"
-]
-```
-
-Or :
 
-```json
-"stop_urls": [
-  "http://website.com/page/"
-]
-```
+**Update** You can check [the DocSearch dedicated documentation website](https://community.algolia.com/docsearch/documentation/docsearch-scraper/config-options/) if you need more details regarding how to fine-tune your configuration.