You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using Pyspark along with Celery in a Django app. So the flow of my code is as follows:
Put a POST request to upload a file (large file).
Django handles the request and loads the file to Hdfs. This large file in Hdfs is read by Pyspark to load it into the Cassandra.
This upload is handled by Celery (from reading the file to Cassandra upload). Celery starts the process in the background and starts a spark context to start the upload.
The data gets loaded to Cassandra, but the spark context which was created via the celery does not stop even after using spark.stop() when the load is complete.
project -> celery.py
import os
from celery import Celery
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'project.settings')
app = Celery('project')
app.config_from_object('django.conf:settings', namespace='CELERY')
app.autodiscover_tasks()
tasks.py
import celery
from project.celery import app
from cassandra.cluster import Cluster
from pyspark.sql import SparkSession
class uploadfile():
def __init__(self):
self.cluster = Cluster(getattr(settings, "CASSANDRA_IP", ""))
self.session = self.cluster.connect()
def start_spark(self):
self.spark = SparkSession.builder.master(getattr(settings,'SPARK_MASTER', settings.SPARK_MASTER))\
.appName('Load CSV to Cassandra')\
.config('spark.jars', self.jar_files_path)\
.config('spark.cassandra.connection.host', getattr(settings,'SPARK_CASSANDRA_CONNECTION_HOST','0.0.0.0'))\
.getOrCreate()
def spark_stop(self):
self.spark.stop()
def file_upload(self):
self.start_spark()
df = self.spark.read.csv(file_from_hdfs)
# do some operation on the dataframe
# self.session.create_cassandra_table_if_does_not_exist
df.write.format('org.apache.spark.sql.cassandra').\
.option('table',table_name)\
.option('keyspace',keyspace)\
.mode('append').save()
self.spark_stop() #<<<-------------------- This does not close the spark context
@task(name="api.tasks.uploadfile")
def csv_upload():
# handle request.FILE and upload the file to hdfs
spark_obj = uploadfile()
spark_obj.file_upload()
calling_task_script.py
from task import csv_upload
from rest_framework.views import APIView
class post_it(APIView):
def post(request):
csv_upload.delay()
return Response('success')
The text was updated successfully, but these errors were encountered:
Yes, the Pyspark context and celery task are binding properly, that is why I am able to submit the task to Pyspark and Pyspark is able to load the data. The only problem is once everything is over, the spark context does not stop.
I am using Pyspark along with Celery in a Django app. So the flow of my code is as follows:
spark.stop()
when the load is complete.project -> celery.py
tasks.py
calling_task_script.py
The text was updated successfully, but these errors were encountered: