The project deals with analyzing Retail-Dataset using Apache Spark. Apache Spark enables large and big data analyses. It does this by parallel processing using different threads and cores optimally. It can therefore improve performance on a cluster but also on a single machine. Analysis is implemented using pyspark
PySpark has been released in order to support the collaboration of Apache Spark and Python, it actually is a Python API for Spark. In addition, PySpark, helps you interface with Resilient Distributed Datasets (RDDs) in Apache Spark and Python programming language. Along with writing Spark applications using Python APIs, it also provides the PySpark shell for interactively analyzing your data in a distributed environment. PySpark supports most of Spark’s features such as Spark SQL, DataFrame, Streaming, MLlib (Machine Learning) and Spark Core.