Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

open source amundsen neo4j backup scripts #196

Closed
javamonkey79 opened this issue Dec 6, 2019 · 24 comments
Closed

open source amundsen neo4j backup scripts #196

javamonkey79 opened this issue Dec 6, 2019 · 24 comments
Labels
keep fresh Disables stalebot from closing an issue

Comments

@javamonkey79
Copy link
Contributor

AC

  • there will be scripts provided that allow amundsen neo4j data to be backed up (on a schedule) to cloud provider blob storage. aws s3 makes the most sense, and if others need other providers (e.g. azure), then they can provide an extension to this functionality
  • once these scripts are established, we should extended them to the k8s setup as well
@feng-tao feng-tao added the keep fresh Disables stalebot from closing an issue label Dec 7, 2019
@feng-tao
Copy link
Member

cc @jinhyukchang
Hey Jin, could you help @javamonkey79 ? Thanks. Given it is almost holiday season, I am not sure if we could do it in 2019.

@javamonkey79
Copy link
Contributor Author

@jinhyukchang @feng-tao update?

@javamonkey79
Copy link
Contributor Author

So, the main thing I'd like to know, is: is it sufficient to simply take a copy of the disk contents? To be able to take an actual backup through the neo4j commands requires enterprise afaik. I am trying out taking disk based snapshots now, but, I am not sure if it will work. Do the Lyft folks know?

@jinhyukchang
Copy link
Contributor

jinhyukchang commented Jan 22, 2020

@javamonkey79 In Lyft, we use APOC to dump data and schema which can be done w/o taking down the DB) and then upload to S3. Copying db file could work, but I was afraid of possibility that it might dump the file when the state is not consistent.

https://neo4j.com/developer/neo4j-apoc/

@javamonkey79
Copy link
Contributor Author

@jinhyukchang super, thanks, I'll check out apoc then.

@jinhyukchang
Copy link
Contributor

jinhyukchang commented Jan 23, 2020

Actually, this link would be better one.
https://neo4j.com/docs/labs/apoc/current/export/

gabrielucelli pushed a commit to gabrielucelli/amundsen that referenced this issue Jan 28, 2020
* Implement APIs/Saga/Reducer for user 'own' and user 'read' resources
* Added bookmarks, read, own to profile page
* Refactor styles related to pagination and list items
@javamonkey79
Copy link
Contributor Author

Ok, I put a PR for this

#281

@javamonkey79
Copy link
Contributor Author

PR merged, this is done

@javamonkey79
Copy link
Contributor Author

@jinhyukchang @feng-tao sadly, when I rolled this out to our prod instance, I found the files were strangely small. Thinking through this a little more, I think it is because the while wait loop is checking the file, but, I think the file does exist in an intermediate state. I'll work on a fix tomorrow. I'm not sure why this didn't happen in our QA env.

@javamonkey79
Copy link
Contributor Author

@jinhyukchang @feng-tao ok, I checked our runs again on prod (I left it running), and while the first few runs emitted very little\no data the job did start working. I'm not sure why it would work after the first few runs. I'm still still fairly sure it needs to block on the export call somehow, but, I'm still trying to think of how to achieve that, since the export call is an async rest call. There seem to be a few choices:

  1. Check on the output of the invoked rest call, somehow. This might not be possible.
  2. Check the emitted output file length; it may be possible to count the records and match that up with the response of the rest call.
  3. Switch to using another image instead of the aws cli image I've been using, to a neo4j image and run the commands local to the pod instead of restfully. I ran into problems with this approach, which is why I took the approach I am on. So, I'm hesitant to try this, but, I may fall back to it.

I'll do some more testing today, to try and isolate the issue and provide a fix.

cc @samshuster

@jinhyukchang
Copy link
Contributor

@jinhyukchang @feng-tao ok, I checked our runs again on prod (I left it running), and while the first few runs emitted very little\no data the job did start working. I'm not sure why it would work after the first few runs. I'm still still fairly sure it needs to block on the export call somehow, but, I'm still trying to think of how to achieve that, since the export call is an async rest call. There seem to be a few choices:

  1. Check on the output of the invoked rest call, somehow. This might not be possible.
  2. Check the emitted output file length; it may be possible to count the records and match that up with the response of the rest call.
  3. Switch to using another image instead of the aws cli image I've been using, to a neo4j image and run the commands local to the pod instead of restfully. I ran into problems with this approach, which is why I took the approach I am on. So, I'm hesitant to try this, but, I may fall back to it.

I'll do some more testing today, to try and isolate the issue and provide a fix.

cc @samshuster

Hi @javamonkey79 ,
We are not running Neo4j in k8s environment, but is there a way not using REST API, but just use neo4j-shell within Neo4j pod?

For example, this is the call we make and it's a blocked call.
echo "CALL apoc.export.graphml.all(${data_file}, {useTypes: true, readLabels: true});" | ( time ${NEO4J_BIN}/neo4j-shell - ) | tee -a ${BACKUP_LOG_FILE}

@javamonkey79
Copy link
Contributor Author

hey @jinhyukchang thanks for the follow up:

Hi @javamonkey79 ,
We are not running Neo4j in k8s environment, but is there a way not using REST API, but just use neo4j-shell within Neo4j pod?

yup, that's what I mean by option 3. To clarify it a bit though, it may be possible to share the bin files from one container in the neo4j pod to another, but, then there may be environment strings and other setup that could cause setup. Sharing between pods in this way is not typical that I've seen. Basically, those pods are different containers. Right now, one container is the neo4j container, while the backup container is an aws cli based container. I think I may have to use the same image on both containers, which is probably what you're thinking.

@javamonkey79
Copy link
Contributor Author

hey @jinhyukchang I have a new PR for this here. I want to let this run a few times, to make sure it works ok (I just rolled it out to our QA env). So (in case you are really on the ball today), please don't merge until after 12:00pm PST 2/18/20.

I have already done some basic testing on it, to make sure it is good and so far it looks right. I will note, it took a little longer as there was a strange issue with our cluster neo4j pvc on QA. I think this contributed to some of the issues I saw.

@javamonkey79
Copy link
Contributor Author

@jinhyukchang @feng-tao I apologize, but, there is some yet unknown issue with this. I am still working on it. Once I figure it out, I'll let it run for a week and then let you know when it is good again.

The problem is, the while loop gets caught infinitely because the data file is not present. I thought the problem was related to the persistent volume, but, I updated it yesterday and it is still having issues.

The really odd thing, is that the issue is isolated to our QA cluster. Our PROD cluster is running the backup cron job just fine.

cc @samshuster

@jinhyukchang
Copy link
Contributor

No problem, @javamonkey79
No rush here, and let us know when it's ready.

@feng-tao
Copy link
Member

just saw this, thanks @javamonkey79 , let us know once it is ready

@javamonkey79
Copy link
Contributor Author

@feng-tao @jinhyukchang the changes are looking good in QA thus far, but, let's definitely stick to the 1 week rollout.

@javamonkey79
Copy link
Contributor Author

@jinhyukchang I am observing a frustrating issue with regards to neo4j. I wonder if you have encountered it before, or, if it could be related to k8s setup. Basically:

  • the cron pod starts, installs pip, aws-cli then makes the call to the cypher query to export the schema\data
  • neo4j then basically locks up
  • neo4j no longer returns data
  • neo4j does however still respond to network request in some limited capacity
  • neo4j webui stays up, but has no data and will not query
  • there are no errors in any logs

The only differences that I've noted from what you've mentioned to my setup:

  • I am using cyper-shell instead of neo4j-shell (I could not get neo4j-shell to work, and, it seems to be deprecated anyways)
  • My invocations are coming from another container instead of from the neo4j container itself; I think this is the canonical approach to this sort of work.

Does your process block other queries while backups are running? What sort of cadence are you running (daily, hourly, etc)? Have you tried out cypher-shell instead? tia....

@jinhyukchang
Copy link
Contributor

@javamonkey79 Unfortunately, I didn't experience your symptom. We are performing backup every 10 minutes and it's not affecting performance at all not to mention that it's not blocking other queries.

The main different I see is neo4j-shell vs cypher-shell. I can see your cypher-shell command is passing bolt protocol where I suspect it works differently from Neo4j-shell.

Could you try to make neo4j-shell work? (Also, I didn't see any mention about deprecating Neo4j-shell)
https://neo4j.com/developer/kb/using-neo4j-shell-neo4j-ce-3x/

@javamonkey79
Copy link
Contributor Author

@jinhyukchang it's a little hard to find, but, this is the error I encountered:

https://stackoverflow.com/q/21448081/27657

From there, you can look up in the docs (again, hard to find):

https://neo4j.com/docs/operations-manual/3.3/configuration/ports/

Neo4j-shell 1337 dbms.shell.port
The neo4j-shell tool is being deprecated and it is recommended to discontinue its use. Supported tools that replace the functionality neo4j-shell are described under Chapter 10, Tools.

@jinhyukchang
Copy link
Contributor

jinhyukchang commented Feb 21, 2020

Interesting.
I was checking our config and it's just using default port

#⁠ Enable a remote shell server which Neo4j Shell clients can log in to.
dbms.shell.enabled=true
#⁠ The network interface IP the shell will listen on (use 0.0.0.0 for all interfaces).
#⁠dbms.shell.host=127.0.0.1
#⁠ The port the shell will listen on, default is 1337.
#⁠dbms.shell.port=1337

Could you check your config and confirm if Neo4j is using it?

@javamonkey79
Copy link
Contributor Author

@jinhyukchang I checked, and our version (which is the one here for the community as well btw) does not have the shell switch enabled:

"dbms.shell.enabled" | "Enable a remote shell server which Neo4j Shell clients can log in to. Only applicable to `neo4j-shell`." | "false"
-- | -- | --

I double checked, and it's not listening on 1337 either.

I suppose I could try setting the flag to true and try running through neo4j-shell again.

@javamonkey79
Copy link
Contributor Author

@jinhyukchang just a quick update; I've tested out the neo4j-shell approach and it has been working for a few days in our QA and DEV envs. We suspect that perhaps b\c cypher-shell communicates over bolt, that it is causing some port conflict issue, but, we're not sure. I will let this run through the weekend and if it looks good Monday I'll roll it out to prod; after that, I think we can merge.

@javamonkey79
Copy link
Contributor Author

Here is an example one time pod to restore, I'll add this to the docs on my next PR:

    apiVersion: v1
    kind: Pod
    metadata:
      name: restore-neo4j-from-latest
    spec:
      containers:
      - name: restore-neo4j-from-latest
        image: neo4j:3.3.0
        command:
         - "/bin/sh"
         - "-c"
         - |
            apk -v --update add --no-cache --quiet curl python py-pip && pip install awscli -q
            latest_backup=$(aws s3api list-objects-v2 --bucket "$BUCKET" --query 'reverse(sort_by(Contents, &LastModified))[:1].Key' --output=text)
            aws s3 cp s3://$BUCKET/$latest_backup /tmp
            tar -xf /tmp/$latest_backup -C /
            data_file=`ls /data|grep \.data`
            schema_file=`ls /data|grep \.schema`
            ./bin/neo4j-shell -host neo4j -file /data/$schema_file
            echo "CALL apoc.import.graphml('/data/$data_file', {useTypes: true, readLabels: true});" | /var/lib/neo4j/bin/neo4j-shell -host neo4j
        env:
          - name: BUCKET
            value: my-bucket-name
        volumeMounts:
          - name: data
            mountPath: /data        
      restartPolicy: OnFailure
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: neo4j-pvc

dorianj pushed a commit to dorianj/amundsen that referenced this issue Apr 25, 2021
* Implement APIs/Saga/Reducer for user 'own' and user 'read' resources
* Added bookmarks, read, own to profile page
* Refactor styles related to pagination and list items
feng-tao pushed a commit that referenced this issue May 7, 2021
* Implement APIs/Saga/Reducer for user 'own' and user 'read' resources
* Added bookmarks, read, own to profile page
* Refactor styles related to pagination and list items
hansadriaans pushed a commit to DataChefHQ/amundsen that referenced this issue Jun 30, 2022
* Implement APIs/Saga/Reducer for user 'own' and user 'read' resources
* Added bookmarks, read, own to profile page
* Refactor styles related to pagination and list items
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keep fresh Disables stalebot from closing an issue
Projects
None yet
Development

No branches or pull requests

3 participants