Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build Kafka Connector for Dgraph in Live and Bulk Loader #3967

Closed
mangalaman93 opened this issue Sep 11, 2019 · 11 comments
Closed

Build Kafka Connector for Dgraph in Live and Bulk Loader #3967

mangalaman93 opened this issue Sep 11, 2019 · 11 comments
Labels
area/integrations Related to integrations with other projects. kind/feature Something completely new we should consider. popular status/accepted We accept to investigate/work on it. status/needs-specs Issues that require further specification before implementation.

Comments

@mangalaman93
Copy link
Contributor

This will allow loading data directly from Kafka.

@danielmai danielmai added the kind/feature Something completely new we should consider. label Sep 11, 2019
@willem520
Copy link

great idea,I hoop Dgraph can has a close connection with process engine(eg. Flink, Spark) in the near future

@campoy campoy added area/integration area/integrations Related to integrations with other projects. and removed area/integration labels Sep 13, 2019
@campoy
Copy link
Contributor

campoy commented Sep 13, 2019

Hey @willem520,

Could you tell us more about what you would expect from these integrations with Flink or Spark?

@campoy campoy added the status/needs-specs Issues that require further specification before implementation. label Sep 13, 2019
@willem520
Copy link

Hey @willem520,

Could you tell us more about what you would expect from these integrations with Flink or Spark?

hello, in my project,I want to use Flink or Sparking streaming to process Rdf or json data in realtime, and I need to transport history data from other graph database(eg janusGraph) to Dgraph.but I found, when I used the spark and dgraph4j to process large dataset(eg 5 million node),it was always failed, and sometime, there was breakdown in alpha.

@campoy
Copy link
Contributor

campoy commented Sep 17, 2019

I'm sorry but I'm going to need more information on what you were actually building and how it failed.

If I understand correctly, you're processing a stream of events in RDF or JSON format?
Or is it a batch analysis with 5 million nodes?

What exact API would you like us to provide to integrate with Spark or Flink?

@willem520
Copy link

willem520 commented Sep 19, 2019

hi,I used Spark to load 5 million node into memory and used 100 partitions to process data, in each partition, I build 2000 node with JSON format into a mutation,an used dgraph4j client to execute txn.mutate. when I run the program,it was failed, and got the error message
image
if I used a small dataset(eg 500000 node) in the same program, it was successed.

@mangalaman93
Copy link
Contributor Author

mangalaman93 commented Sep 19, 2019

How many cores are you providing to each executor? How many executors are you running concurrently? You could try reducing the size of each transaction so that it finishes quickly and total number of pending transactions are fewer.

@willem520
Copy link

willem520 commented Sep 23, 2019

I used 4executor-cores,5num-executors.I needed to import at least 100 million data to Dgraph

@Naralux
Copy link

Naralux commented Oct 3, 2019

Not directly related to Dgraph, but Neo4j just announced a new product which will tightly integrate Neo4j with Kafka. I feel like this is a feature which might greatly impact DB choice for (new) projects. https://www.datanami.com/2019/10/01/neo4j-gets-hooks-into-kafka/

@marvin-hansen
Copy link

@AshNL Have you ever used neo4j in your entire life?

We did for ~3 months and actually migrating everything away from it to save our sanity and company. I cannot remember any other database that was causing more operational problems, more concurrency issues, and consistently terrible performance. The most mind-boggling thing is, the company indeed listens to all reported problems, but they never fixed anything...

Meanwhile, we run the most mission-critical stuff on Postgres. We de-normalized those few tables to operate entirely join-free to sustain very high performance.

With DGraph, there are a few rough edges because its relatively new, but for the most part, when it runs, it just runs.

For the aforementioned Kafka connector, there are tutorials out of how to write one. I think implementing the connector with a queue and proper batch-writing should do the trick.

https://www.confluent.fr/blog/create-dynamic-kafka-connect-source-connectors/

@Naralux
Copy link

Naralux commented Jan 16, 2020

No need to start biting. I'm sorry I'm not as experienced as you are. In the meantime I have indeed written my own connector.

@shekarm shekarm added the status/accepted We accept to investigate/work on it. label Feb 18, 2020
@minhaj-shakeel
Copy link
Contributor

Github issues have been deprecated.
This issue has been moved to discuss. You can follow the conversation there and also subscribe to updates by changing your notification preferences.

drawing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/integrations Related to integrations with other projects. kind/feature Something completely new we should consider. popular status/accepted We accept to investigate/work on it. status/needs-specs Issues that require further specification before implementation.
Development

No branches or pull requests

9 participants