Here we discuss various optimisation and troubleshooting Error scerios for Spark Application. We focus on community resources to help write better ETL pipelines. If you have an interesting approach that helped improve performance feel free to head to live demo site and post it up for other's benefits.
git clone https://github.com/chintanagrawal97/sparkLogDebugger.git
python3 -m venv spark
source spark/bin/activate
python install -r requirements.txt
python run.pyAfter starting the app. Create a profile. Go to Spark Logs --> Feed in your applicationId and ClusterId details for eg.
Refer to the screenshots.
The Application expects the container files to be present as follows :
<User_provided_Log_Path>/<Cluster_Id>/containers/<application_id>/..Refer to code line 464 in SparScript.py
data=getListOfFiles(LOGPATH+'/'+CLUSTER_ID+'/containers/'+APPLICATION_ID)- Decouple FrontEnd and Backend .
- Incorporate other application such as HIVE , PRESTO .