Skip to content

Commit

Permalink
Reorganizes the content and media to make chapters insertion easier.
Browse files Browse the repository at this point in the history
Fixes #28
  • Loading branch information
Frédéric De Villamil committed Sep 24, 2018
1 parent 1d54702 commit 0a4c1b3
Show file tree
Hide file tree
Showing 68 changed files with 167 additions and 158 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ Tribe nodes connect to multiple Elasticsearch clusters and performs operations s

### A Minimal, Fault Tolerant Elasticsearch Cluster

![A Minimal Elasticsearch cluster](images/001-getting-started/image1.svg)
![A Minimal Elasticsearch cluster](images/image1.svg)

A minimal fault tolerant Elasticsearch cluster should be composed of:

Expand All @@ -111,7 +111,7 @@ Lucene is the name of the search engine that powers Elasticsearh. It is an open

A `shard` is made of one or multiple `segments`, which are binary files where Lucene indexes the stored documents.

![Inside an Elasticsearch index](images/001-getting-started/image2.svg)
![Inside an Elasticsearch index](images/image2.svg)

If you're familiar with relational databases such as MySQL, then an `index` is a database, the `mapping` is the database schema, and the shards represent the database data. Due to the distributed nature of Elasticsearch, and the specificities of Lucene, the comparison with a relational database stops here.

Expand All @@ -137,7 +137,7 @@ TODO [issue #10](https://github.com/fdv/running-elasticsearch-fun-profit/issues/

## Elasticsearch Configuration

TODO [issue #10](https://github.com/fdv/running-elasticsearch-fun-profit/issues/0)
TODO [issue #10](https://github.com/fdv/running-elasticsearch-fun-profit/issues/10)

## Elasticsearch Plugins

Expand Down
File renamed without changes
File renamed without changes.
File renamed without changes
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ As many data nodes as you need, split evenly between both main locations.

Architecture of a fault tolerant Elasticsearch cluster

![Architecture of a fault tolerant Elasticsearch cluster](images/003-cluster-design/image1.svg)
![Architecture of a fault tolerant Elasticsearch cluster](images/image1.svg)

Elasticsearch design for failure

Expand Down Expand Up @@ -83,7 +83,7 @@ Lucene is the name of the search engine that powers Elasticsearh. It is an open

Each Elasticsearch index is divided into shards. Shards are both logical and physical division of an index. Each Elasticsearch shard is a Lucene index. The maximum number of documents you can have in a Lucene index is 2,147,483,519. The Lucene index is divided into smaller files called segments. A segment is a small Lucene index. Lucene searches in all segments sequentially.

![Inside an Elasticsearch index](images/003-cluster-design/image2.svg)
![Inside an Elasticsearch index](images/image2.svg)

Lucene creates a segment when a new writer is opened, and when a writer commits or is closed. It means segments are immutable. When you add new documents into your Elasticsearch index, Lucene creates a new segment and writes it. Lucene can also create more segments when the indexing throughput is important.

Expand Down Expand Up @@ -358,11 +358,11 @@ The state of your thread pools, especially the state of rejected threads. And ra

The number of search / writes per second

![Monitoring Elasticsearch request rate](images/003-cluster-design/image4.png)
![Monitoring Elasticsearch request rate](images/image4.png)

The average time taken by your queries

![Monitoring Elasticsearch search latency](images/003-cluster-design/image5.png)
![Monitoring Elasticsearch search latency](images/image5.png)

A good way to know which queries take the more time is by using Elasticsearch slow queries logs.

Expand Down
File renamed without changes
File renamed without changes.
File renamed without changes
File renamed without changes.
File renamed without changes
File renamed without changes
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ index:
If you start combining events analysis with alerting, or if you need your events to be searchable in realtime without downtime, then things get a bit more expensive. For example, you might want to correlate your whole platform auth.log to look for intrusion attempts or port scanning, so you can deploy new firewall rules accordingly. Then you'll have to start with a 3 nodes cluster. 3 nodes is a minimum since you need 2 active master nodes to avoid a split brain.
![](images/004-design-event-logging/image6.svg)
![](images/image6.svg)
Here, the minimal hosts configuration for the master / http node is:
Expand Down
File renamed without changes
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -20,21 +20,21 @@ Each parts extensively covers the critical things to have a look at, and gives y

Elastic provides an extensive monitoring system through the X-Pack plugin. X-Pack has a free license with some functional limitations. The free license only lets you manage a single cluster, a limited amount of nodes, and has a limited data retention. X-Pack documentation is available at [https://www.elastic.co/guide/en/x-pack/index.html](https://www.elastic.co/guide/en/x-pack/index.html)

![](images/006-monitoring-es/image7.png)
![](images/image7.png)

I have released 3 Grafana dashboards to monitor Elasticsearch Clusters using the data pushed by the X-Pack monitoring plugin. They provide much more information that the X-Pack monitoring interface, and are meant to be used when you need to gather data from various sources. They are not meant to replace X-Pack since they don't provide security, alerting or machine learning feature.

Monitoring the at the cluster level: [https://grafana.com/dashboards/3592](https://grafana.com/dashboards/3592)

![](images/006-monitoring-es/image8.png)
![](images/image8.png)

Monitoring at the node level: [https://grafana.com/dashboards/3592](https://grafana.com/dashboards/3592)

![](images/006-monitoring-es/image9.png)
![](images/image9.png)

Monitoring at the index level: [https://grafana.com/dashboards/3592](https://grafana.com/dashboards/3592)

![](images/006-monitoring-es/image10.png)
![](images/image10.png)

These dashboards are meant to provide a look at everything Elasticsearch sends to the monitoring node. It doesn't mean you'll actually need these data.

Expand Down
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
Original file line number Diff line number Diff line change
Expand Up @@ -166,14 +166,14 @@ The migration processes were running on 8 virtual machines with 4 core and 8GB R

During the second part, we merged the data from the Kafka with data from 2 other Galera clusters and an Elasticsearch cluster before pushing them into Blackhole.

![Blackhole initial migration](images/100-use-cases-reindexing-36-billion-docs/image7.svg)
![Blackhole initial migration](images/image7.svg)

## Blackhole initial migration

The merge and indexing parts took place on 8 virtual machines, each having 4 core and 8GB RAM. Each machine was running 8 indexing processes reading an offset of a Kafka partition.

The indexer was shard aware. It had a mapping between the index it was writing on, its shards and the data node they were hosted on. This allowed to index directly on the right data nodes with the lowest possible network latency.
![Blackhole sharding](images/100-use-cases-reindexing-36-billion-docs/image8.svg)
![Blackhole sharding](images/image8.svg)

This part was not as smooth as we expected.

Expand All @@ -183,7 +183,7 @@ Surprisingly, the main bottleneck was neither one of the Galera clusters nor the

The other unexpected bottleneck was the CPU. Surprisingly, we were CPU bound but the disks were not a problem (which is normal since we're using SSDs).

![Blackhole system load](images/100-use-cases-reindexing-36-billion-docs/image9.png)
![Blackhole system load](images/image9.png)

After 9 days, the data was fully indexed and we could start playing with the data.

Expand All @@ -209,19 +209,19 @@ Logstash has both an Elasticsearch input, for reading, an Elasticsearch output,

We ran a few tests on Blackhole to determine which configuration was the best, and ended with 5000 documents scrolls and 10 indexing workers.

![Testing blackhole metrics](images/100-use-cases-reindexing-36-billion-docs/image10.png)
![Testing blackhole metrics](images/image10.png)

Testing with 5000 documents scroll and 10 workers

For these tests, we were running with a production configuration, which explains the refreshes and segment count madness. Indeed, running with 0 replica was faster, but since we're using RAID0, this configuration was a no go.

During the operation, both source and target nodes behaved without problems, specifically on the memory level.

![Testing blackhole memory metrics](images/100-use-cases-reindexing-36-billion-docs/image11.png)
![Testing blackhole memory metrics](images/image11.png)

Source node for reindexing

![Testing blackhole node behavior](images/100-use-cases-reindexing-36-billion-docs/image12.png)
![Testing blackhole node behavior](images/image12.png)

Target node behavior

Expand Down Expand Up @@ -354,15 +354,15 @@ Moulinette is the processing script. It's a small daemon written in Bash (with s

Once again, the main problem was being CPU bound. As you can see on that Marvel screenshot, the cluster was put under heavy load during the whole indexing process. Considering that we were both reading and writing on the same cluster, **with an indexing rate over 90,000 documents per second with 140,000 documents per second peaks**, this is not surprising at all.

![Reindexing blackhole, 2 days after](images/100-use-cases-reindexing-36-billion-docs/image13.png)
![Reindexing blackhole, 2 days after](images/image13.png)

Having a look at the CPU graphs, there was little we could to to improve the throughput without dropping Logstash and relying on a faster solution running on less nodes.

![CPU usage](images/100-use-cases-reindexing-36-billion-docs/image14.png)
![CPU usage](images/image14.png)

The disks operations show well the scroll / index processing. There was certainly some latency inside Logstash for the transform process, but we didn't track it.

![Disks operations](images/100-use-cases-reindexing-36-billion-docs/image15.png)
![Disks operations](images/image15.png)

The other problem was losing nodes. We had some hardware issues and lost some nodes here and there. This caused indexing from that node to crash and indexing to that node to stale since Logstash does not exit when the output endpoint crashes.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,11 @@ Once the cluster was all green, I could shutdown the second Canadian node, then

You may have noticed that at that time, routing nodes were still in Canada, and data in France.

![Cluster Topology](images/101-use-case-migrating-cluster-over-ocean/image2.svg)
![Cluster Topology](images/image2.svg)

That's right. The latest part of it was playing with DNS.

![Changing the DNS on Amazon](images/101-use-case-migrating-cluster-over-ocean/image3.png)
![Changing the DNS on Amazon](images/image3.png)

The main ES hostname the application accesses is managed using Amazon Route53. Route53 provides some nice round robin thing so the same A record can point on many IPs or CNAME with a weight system. It's pretty cool even though it does not provide failover. If one of your nodes crash, it needs to unregister itself from route53.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Reindexing that cluster was easy because it was not on production yet. Reindexin

As you can see on the screenshot below, our main bottleneck the first time we reindexed Blackhole, the well named, was the CPU. Having the whole cluster at 100% and a load of 20 is not an option, so we need to find a workaround.

![Cluster load on Marvel](images/102-use-case-advanced-architecture-high-volume-reindexing/image1.png)
![Cluster load on Marvel](images/image1.png)

This time, we won't reindex Blackhole but Blink. Blink stores the data we display in our clients dashboards. We need to reindex them every time we change the mapping to enrich that data and add new feature our clients and colleagues love.

Expand All @@ -26,7 +26,7 @@ Blink is a group of 3 clusters built around 27 physical hosts each, having 64GB

The data nodes have 4*800GB SSD drives in RAID0, about 58TB per cluster. The data and nodes are configured with Elasticsearch zones awareness. With 1 replica for each index, that makes sure we have 100% of the data in each data center so we're crash proof.

![Blink Architecture](images/102-use-case-advanced-architecture-high-volume-reindexing/image2.svg)
![Blink Architecture](images/image2.svg)

We didn't allocate the http query nodes to a specific zone for a reason: we want to use the whole cluster when possible, at the cost of 1.2ms of network latency. From [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/allocation-awareness.html):

Expand All @@ -50,7 +50,7 @@ backend be_blink01
```

So our infrastructure diagram becomes:
![Blink infrastructure with datacenter awareness](images/102-use-case-advanced-architecture-high-volume-reindexing/image3.svg)
![Blink infrastructure with datacenter awareness](images/image3.svg)

In front of the Haproxy, we have an applicative layer called Baldur. Baldur was developed by my colleague [Nicolas Bazire](https://github.com/nicbaz) to handle multiple versions of a same Elasticsearch index and route queries amongst multiple clusters.

Expand Down Expand Up @@ -83,7 +83,7 @@ client id / report id / active index id
So the API knows which index it should use as active.

Just like the load balancers, Baldur holds a VIP managed by another Keepalived, for fail over.
![Cluster architecture with Baldur](images/102-use-case-advanced-architecture-high-volume-reindexing/image4.svg)
![Cluster architecture with Baldur](images/image4.svg)

---

Expand Down Expand Up @@ -225,7 +225,7 @@ node:

Here's what our infrastructure looks like now.

![The final cluster](images/102-use-case-advanced-architecture-high-volume-reindexing/image5.svg)
![The final cluster](images/image5.svg)

It's now time to reindex.

Expand Down Expand Up @@ -272,7 +272,7 @@ Here, we:
* read from the passive http query nodes. Since they're zone aware, they query the data in the same zone in priority
* write on the data nodes inside the indexing zone so we won't load the nodes accessed by our clients

![Reindexing with Baldur](images/102-use-case-advanced-architecture-high-volume-reindexing/image6.svg)
![Reindexing with Baldur](images/image6.svg)

Once we've done with reindexing a client, we update Baldur to change the active indexes for that client. Then, we add a replica and move the freshly baked indexes inside the production zones.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,11 +23,12 @@ The average cluster stats are 85k writes / second, with 1.5M in peak, and 800 re
Blackhole is a 77 nodes cluster, with 200TB storage, 4.8TB RAM, 2.4TB being allocated to Java, and 924 CPU core. It is made of 3 master nodes, 6 ingest nodes, and 68 data nodes. The cluster holds 1137 indices, with 13613 primary shards, and 1 replica, for 201 billion documents. It gets about 7000 new documents / second, with an average of 800 searches / second on the whole dataset.

Blackhole data nodes are spread between 2 data center. By using rack awareness, we make sure that each data center holds 100% of the data, for high availability. Ingest nodes are rack aware as well, to leverage Elasticsearch prioritising nodes within the same rack when running a query. This allows us to minimise the latency when running a query. A Haproxy controls both the ingest nodes health and proper load balancing amongst all of them.
![Blackhole rack awareness design](images/103-use-case-migrating-130tb-cluster-without-downtime/image16.svg)

![Blackhole rack awareness design](images/image16.svg)

Blackhole is feeding is a small part of a larger processing chain. After multiple enrichment and transformation, the data is pushed into a large Kafka queue. A working unit reads the Kafka queue and pushes the data into Blackhole.

![Blackhole processing chain](images/103-use-case-migrating-130tb-cluster-without-downtime/image17.svg)
![Blackhole processing chain](images/image17.svg)

This has many pros, the first one being to be able to replay a whole part of the process in case of error. The only con here is having enough disk space for the data retention, but in 2017 disk space is not a problem anymore, even on a 10s of TB scale.

Expand Down Expand Up @@ -96,7 +97,7 @@ The first migration step was throwing more hardware at blackhole.
We added 90 new servers, split in 2 data centers. Each server has a 6 core Xeon E5--1650v3 CPU, 64GB RAM, and 2 * 1.2TB NVME drives, setup as a RAID0. These servers were set up to use a dedicated network range as we planned to use them to replace the old Blackhole cluster and didn't want to mess with the existing IP addresses.

These servers were deployed with a Debian Stretch and Elasticsearch 2.3. We had some issues as Elasticsearch 2 systemd scripts don't work on Stretch, so we had to run the service manually. We configured Elasticsearch to use 2 new racks, Barack and Chirack. Then, we updated the replication factor to 3.
![Blackhole, expanded](images/103-use-case-migrating-130tb-cluster-without-downtime/image18.svg)
![Blackhole, expanded](images/image18.svg)

```bash
curl -XPUT "localhost:9200/*/_settings" -H 'Content-Type: application/json' -d '{
Expand All @@ -115,7 +116,7 @@ On the vanity metrics level, Blackhole had:
* 10,84TB RAM, 5.42TB being allocated to Java,
* 2004 core.

![Blackhole on steroids](images/103-use-case-migrating-130tb-cluster-without-downtime/image19.png)
![Blackhole on steroids](images/image19.png)

If you're wondering why we didn't decide to save time, only raising the replication factor to 2, then do it, lose a data node, enjoy, and read the basics of distributed systems before you want to run one in production.

Expand Down Expand Up @@ -171,7 +172,7 @@ Thankfully, there's a trick for that.
Elasticsearch provides a concept of zone, which can be combined with rack awareness for a better allocation granularity. For example, you can dedicate lot of hardware to the freshest, most frequently accessed content, less hardware to content accessed less frequently and even less hardware to content that is never accessed. Zones are configured on the host level.
![Zone configuration](images/103-use-case-migrating-130tb-cluster-without-downtime/image20.svg)
![Zone configuration](images/image20.svg)
We decided to create a zone that would only hold the data of the day, so the hardware would be less stressed by the migration.
Expand Down Expand Up @@ -234,7 +235,7 @@ curl -XPUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json
```

Then, we shat down Elasticsearch on Barack, Chirack and one of the cluster master nodes.
![Moving from zone to zone](images/103-use-case-migrating-130tb-cluster-without-downtime/image21.svg)
![Moving from zone to zone](images/image21.svg)

Removing nodes to create a new Blackhole

Expand Down Expand Up @@ -283,7 +284,7 @@ The last thing was to add a work unit to feed that Blackhole02 cluster and catch

The whole migration took less than 20 hours, including transferring 130TB of data on a dual data center setup.

![The migration](images/103-use-case-migrating-130tb-cluster-without-downtime/image22.svg)
![The migration](images/image22.svg)
The most important point here was that we were able to rollback at any time, including after the migration if something was wrong on the application level.

Deciding to double the cluster for a while was mostly a financial debate, but it had lots of pros, starting with the security it brought, as well as changing the whole hardware that had been running for 2 years.
Loading

0 comments on commit 0a4c1b3

Please sign in to comment.