Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cassandra backup/restore #29

Open
jazzl0ver opened this issue Feb 8, 2018 · 9 comments
Open

Cassandra backup/restore #29

jazzl0ver opened this issue Feb 8, 2018 · 9 comments

Comments

@jazzl0ver
Copy link
Collaborator

jazzl0ver commented Feb 8, 2018

Just would like to discuss ideas on the best way to have that implemented. I thought to integrate Netflix's Priam, but it doesn't seem to work as a backup/restore solution only.
Another cool tool is https://github.com/pearsontechnology/cassandra_snap. However, it needs ssh access to each instance and requires to enlist all nodes to take backup from, rather than figure that out automatically.
What are your thoughts?

@JuniusLuo
Copy link
Contributor

Both Priam and cassandra_snap leverages Cassandra Snapshot. See Priam Backup.
We may be able to run Priam as the sidecar container of the service container. The sidecar container could access the same volume with the service container. AWS ECS supports multiple containers in one TaskDefinition. Kubernetes supports Pod, which could also run multiple containers. So the sidecar container will work for both AWS ECS and Kubernetes. While, currently Docker Swarm does not have this ability. Some prototype will be required to see if Priam works in the sidecar container. Of course, developing our own sidecar container is another option.

We could also consider to leverage EBS Snapshot. FireCamp Cassandra enables the remote JMX. We could run one or multiple job containers, which runs nodetool to connect to one or multiple Cassandra containers and flush the memtables to disk. After the flush finishes for all replicas, another job container(s) could be run to take snapshot of the EBS volumes. This is also eventual consistency with taking snapshot of all Cassandra replicas, and relies on Cassandra's built-in consistency mechanisms to resume consistency for the restored snapshot.
AWS EBS Snapshots are incremental backups, which means that only the blocks on the device that have changed after your most recent snapshot are saved.
The restore will only take 3 steps: 1) stop all containers. 2) restore the EBS to snapshot. 3) start containers.

@jazzl0ver
Copy link
Collaborator Author

Thanks for sharing your thoughts!

I might be wrong, but according to Netflix/Priam#649, Priam can't be used just as a backup solution.

What do you think of having a cronjob on the each C* node which will launch nodetool snapshot followed by a aws ec2 create-snapshot and aws ec2 delete-snapshot (for old snapshots) for the volumes of that node? This job could be created/altered/deleted by a command to firecamp-manageserver. Besides time of backup we might set snapshot volumes tags (with, for example, the node information), retention time, email or SNS topic for alerts in case of issues, etc.
It would be great to automate the restore by a command to the manage server as well, giving the time of available backup. A list command could display the available recovery points (based on snapshot tags and datetime).

Having this implemented would also simplify launching a new C* from existing backup.

@JuniusLuo
Copy link
Contributor

Yes, Priam is more than backup/recovery. Didn't check the detail design/implementation. As you posted, it might not be able to only use the backup function.

The cronjob may not be the best option. The nodes in one cluster may run multiple services. Different services will have different requirements for backup. The cronjob will end up to handle all services. It would be better to use the job container, which could be triggered on demand. The job container could launch nodetool flush and then call aws api to create EBS snapshot. Every service could have its own job container.
We should use nodetool flush instead of nodetool snapshot. The nodetool snapshot will create the hardlinks to the SSTables. If you don't delete the hardlinks by yourselves, the SSTable will never get deleted. The disk will fill up.

Yes, we could automate the restore. A list command will be part of the general data management framework.

We will evaluate other services as well for the general data management framework design.

@jazzl0ver
Copy link
Collaborator Author

By C* node I meant a container where C* daemon is running, not an EC2 instance. Sorry for misleading.
And, yes, of course flush, not snapshot. My bad.

Looks like the separate container (within the same task) indeed better than cronjob from the different services backup management point of view: each service might have a backup job container with its own logic.

@jazzl0ver
Copy link
Collaborator Author

@nagaraj07
Copy link

nagaraj07 commented May 31, 2019 via email

@jazzl0ver
Copy link
Collaborator Author

I don't use mongodb, so no luck here. Regarding Kafka - why do you need to back it up?

@nagaraj07
Copy link

nagaraj07 commented Jun 3, 2019 via email

@jazzl0ver
Copy link
Collaborator Author

I'm not a part of CloudStax team, so I can't spend time on services we don't use. Sorry about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants