# Distributed Computing

Distributed computing is a method of computing that involves the use of multiple computers or nodes working together as a single system to perform a task or solve a problem. It is often used for tasks that are too large or complex to be performed by a single computer or for tasks that require high levels of computing power, storage, or bandwidth.

In a distributed computing system, tasks are divided into smaller subtasks that are distributed among the nodes, and each node works on its assigned subtask in parallel with the other nodes. The nodes communicate with each other to coordinate their work and share data, and the results of the subtasks are combined to produce the final output.

There are several benefits of distributed computing, including:

- **Increased processing power:** By using multiple nodes to work on a task in parallel, distributed computing can greatly increase processing power and reduce the time needed to complete a task.
- **Improved scalability:** Distributed computing systems can be scaled up or down easily by adding or removing nodes, making them ideal for tasks that require varying levels of computing power.
- **Fault tolerance:** Distributed computing systems are often designed to be fault-tolerant, meaning that if one node fails, the other nodes can continue working on the task without interruption.
- **Lower costs:** By using multiple low-cost computers instead of a single high-cost computer, distributed computing can be a more cost-effective way to perform certain tasks.

Distributed computing is used in many different fields, including scientific computing, data processing, and web applications. Examples of distributed computing systems include Apache Hadoop, Apache Spark, and Amazon Web Services (AWS) Elastic MapReduce.

Here's an example of using PySpark to process a large dataset:

In [1]:
import numpy as np
from pyspark.sql import SparkSession
from pyspark.sql.functions import monotonically_increasing_id
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Generate a random dataset of 10000 samples with 10 features
X = np.random.rand(10000, 10)
y = np.random.randint(2, size=10000)

# Create a Spark session
spark = SparkSession.builder.appName("MyApp").getOrCreate()

# Convert the NumPy arrays to Spark dataframes
X_df = spark.createDataFrame(X.tolist())

# Create a label column for y
y_df = spark.createDataFrame([(int(label),) for label in y], ["label"])
y_df = y_df.withColumn("id", monotonically_increasing_id())

# Add an ID column to X_df
X_df = X_df.withColumn("id", monotonically_increasing_id())

# Combine the feature and label dataframes
data = X_df.join(y_df, on="id")

# Combine the feature columns into a single vector column
assembler = VectorAssembler(inputCols=X_df.columns[:-1], outputCol="features")
data = assembler.transform(data)

# Split the data into training and testing sets
(training_data, testing_data) = data.randomSplit([0.7, 0.3], seed=12345)

# Train a logistic regression model on the training data
lr = LogisticRegression(maxIter=10, labelCol="label")
model = lr.fit(training_data)

# Make predictions on the testing data
predictions = model.transform(testing_data)

# Evaluate the performance of the model
evaluator = BinaryClassificationEvaluator()
auc = evaluator.evaluate(predictions)
print("AUC:", auc)

# Stop the Spark session
spark.stop()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/04/09 15:44:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


                                                                                

AUC: 0.4820883370199694
