# Batch Processing

Batch processing is a data processing method in which a group of data is collected, processed, and analyzed together. In batch processing, data is collected and processed over a period of time, and then processed as a group rather than individually in real-time.

Batch processing is useful when you have large volumes of data that can't be processed in real-time. Batch processing allows you to process data in parallel and optimize resource utilization. This can be done by dividing the data into smaller chunks, processing them separately, and then merging the results.

Batch processing is often used in data warehousing, data analysis, and data mining applications. It is commonly used for processing large datasets, such as log files, customer data, and sales data.

Batch processing can be implemented using a variety of tools and technologies. Some popular batch processing frameworks include Apache Hadoop, Apache Spark, and Apache Flink. These frameworks provide a distributed computing environment that allows you to process large volumes of data in parallel.

In Python, you can use the PySpark API to perform batch processing on large datasets. PySpark is a Python library that provides a Python API for Apache Spark. You can use PySpark to read data from files, databases, or other sources, process the data using distributed computing, and write the results to files, databases, or other destinations.

In [1]:
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("BatchProcessing").getOrCreate()

# Create a custom dataset
data = [("John", 25, "Male"), ("Mary", 30, "Female"), ("Bob", 20, "Male"), ("Alice", 35, "Female")]
columns = ["name", "age", "gender"]
df = spark.createDataFrame(data, columns)

# Show the dataset
print("Initial Dataset:")
df.show()

# Filter the dataset to only include males
male_df = df.filter(df.gender == "Male")

# Show the filtered dataset
print("Filtered Dataset (Males Only):")
male_df.show()

# Calculate the average age of males
average_age = male_df.selectExpr("avg(age)").collect()[0][0]

# Print the average age of males
print("Average Age of Males: {:.2f}".format(average_age))

# Stop the SparkSession
spark.stop()


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/04/09 15:49:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Initial Dataset:


                                                                                

+-----+---+------+
| name|age|gender|
+-----+---+------+
| John| 25|  Male|
| Mary| 30|Female|
|  Bob| 20|  Male|
|Alice| 35|Female|
+-----+---+------+

Filtered Dataset (Males Only):
+----+---+------+
|name|age|gender|
+----+---+------+
|John| 25|  Male|
| Bob| 20|  Male|
+----+---+------+

Average Age of Males: 22.50


In this example, we first create a SparkSession using the `SparkSession.builder` method. Then, we create a custom dataset with the columns `"name", "age", and "gender"`. We show the initial dataset using the `show()` method.

Next, we filter the dataset to only include males using the `filter()` method. We show the filtered dataset using the `show()` method.

Then, we calculate the average age of males using the `selectExpr()` method to select the `"age"` column and calculate the average using the `avg()` function. We collect the result using the `collect()` method and extract the average age using indexing.

Finally, we print the average age of males using the `format()` method and stop the SparkSession using the `stop()` method.

Note that this is a very simple example and batch processing can involve much more complex operations depending on the dataset and use case.