Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Presto-integration-Technical-note] created documentation for presto integration #2568

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

vandana7
Copy link
Contributor

No description provided.

@vandana7 vandana7 force-pushed the presto-documentation branch 4 times, most recently from d4bcce8 to cb76f68 Compare July 27, 2018 07:12
@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7550/

@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6308/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6304/

@CarbonDataQA
Copy link

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7557/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6312/

* ConnectorHandleResolver

1. **CarbonDataConnector :** It implements the Connector Interface of the Presto.
1. **CarbonDataMetadata  :** It implements the ConnectorMetadata Interface of the Presto.  The connector metadata interface has a large number of important methods that are responsible for allowing Presto to look at lists of schemas, lists of tables, lists of columns, and other metadata about a particular data source.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be better that add these description to code as annotation

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7682/

@ravipesala
Copy link
Contributor

SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6081/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6413/


On the other side, Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.

While dealing with Carbondata, both of them have their own advantage but presto is far better than spark while executing 90% of the queries. As the Presto-carbon vector readers are much optimized and reduces the table scan time dealing with large table. Even in case of dictionary aggregation and multiple table join, presto performs much better due to its own optimised way of dealing with properties.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the work far better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, removed the word far better.

* Provide the link between the Functional Requirement and the detailed Technical Design documents.
* Detail the functionality which will be provided by each component or group of components and show how the various components interact in the design.

This document is not intended to address installation and configuration details of the actual implementation. Installation and configuration details are provided in technology guides provided on CarbonData wiki page.As is true with any high level design, this document will be updated and refined based on changing requirements.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by installation and configuration details of actual implementation?

Copy link
Contributor Author

@vandana7 vandana7 Aug 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

installation and configuration details of actual implementation meant we are not providing documentation for installation and configuration of presto integration with carbondata and for this we have a separate document.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it more clear I have linked the installation and configuration for integrating Carbondata with presto to this document. If anyone wants to know about installation and configuration they can easily visit that document page.


This document is not intended to address installation and configuration details of the actual implementation. Installation and configuration details are provided in technology guides provided on CarbonData wiki page.As is true with any high level design, this document will be updated and refined based on changing requirements.
* #### _Scope_
Presto Integration with CarbonData will allow execution of CarbonData queries on the Presto CLI.  CarbonData can be added easily as a Data Source among the multiple heterogeneous data sources for Presto.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Carbondata integration with presto. carbondata is not an execution engine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


1. **Presto**

Integration of Presto with CarbonData includes implementation of connector Api of the Presto.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

carbondata with presto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* Support of Apache CarbonData as Data Source in Presto.
* Execution of Apache CarbonData Queries on Presto.

## Design Considerations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a design from presto which talks about integration of data sources

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


**Performance Optimization by changing Queries:**

- There’s a probability where GROUP BY becomes a little bit faster, by carefully ordering a list of fields within GROUP BY in an order of high cardinality.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the sentence.Probability word need not be used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

- Specifying large tables first in join clause


The default join algorithm of Presto is broadcast join, which partitions the left-hand side table of a join and sends (broadcasts) a copy of the entire right-hand side table to all of the worker nodes that have the partitions. This works when your right table is small enough to fit within one node (usually less than 2GB). If you observe ‘Exceeded max memory xxGB’ error, this usually means the right-hand side table is too large. Presto does not perform automatic join-reordering, so please make sure your large table preceeds small tables in any join clause.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check whether we can use the word like fact table to left and dimension table to right

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


The default join algorithm of Presto is broadcast join, which partitions the left-hand side table of a join and sends (broadcasts) a copy of the entire right-hand side table to all of the worker nodes that have the partitions. This works when your right table is small enough to fit within one node (usually less than 2GB). If you observe ‘Exceeded max memory xxGB’ error, this usually means the right-hand side table is too large. Presto does not perform automatic join-reordering, so please make sure your large table preceeds small tables in any join clause.

**Note :** If you still see the memory issue, try distributed hash join. This algorithm partitions both the left and right tables using the hash values of the join keys. So the distributed join would work even if the right-hand side table is large, but the performance can be slower because it increases the number of network data transfers. To turn on the distributed join, embed the following session property as an SQL comment:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we specify how to use distributed has join

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

```
query.max-memory=210GB
```
This property value should be given according to available RAM as per all cluster worker nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Total RAM available in the cluster(Sum of all nodes RAM memory)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

SELECT ... FROM large_table l, small_table s WHERE l.id = s.id
```

**Performance optimizations by using certain Configuration properties:**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put lower values as default if needs to be specified. better not to give values as it will be directly copy pasted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sraghunandan can you please provide some more clarity on this point, as I am not able to understand this point, I have not provided any values it is only columns I have used in this.

@@ -0,0 +1,27 @@
<!--
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this section right now as the report is not fair.

reasons:

  1. Spark-carbon TPCH results mentioned in website is not matching with this spark-carbon results. Few queries have huge difference due to machine problem [these machines are not in same rack]
  2. Also comparison report should have machine details [RAM, VM/Bare metal] . This was not mentioned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@ravipesala
Copy link
Contributor

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6174/

@ravipesala
Copy link
Contributor

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6175/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7795/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6519/

@sraghunandan
Copy link
Contributor

LGTM

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/37/

@CarbonDataQA
Copy link

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/8786/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/719/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/541/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1990/

@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2167/

@CarbonDataQA
Copy link

Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10242/

@CarbonDataQA
Copy link

Build Failed with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/66/

@CarbonDataQA
Copy link

Build Failed with Spark 2.3.2, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/65/

@CarbonDataQA1
Copy link

Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/66/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants