# Initialization

Download docker and open cmd as administrator.
下载docker，然后以管理员身份打开cmd。

Download apache/spark-py by typing the following command in cmd (or just search for it in the search bar to download it).

In [None]:
docker pull apache/spark-py

# Cluster setup and connectivity

## Open docker

In [None]:
docker run -it apache/spark-py /opt/spark/bin/pyspark

# If this step doesn't work you can try to open it manually in docker.

## Build and connect master and worker nodes

In [None]:
# master：
docker run -it -u root -p 8080:8080 --name spark_master apache/spark-py /bin/bash
cd ..
./bin/spark-class org.apache.spark.deploy.master.Master

# worker：
# If you want to create more than one worker node just set a new name and repeat the code.
docker run -it -u root --link spark_master:spark_master --name spark_worker apache/spark-py /bin/bash
cd ..
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://172.17.0.2:7077


# Connecting with VSCode and Writing & running code

## Connecting with VSCode

In this step, simply tap on VSCode and follow the default guide to download all the required extensions, such as the Remote series and python.

## Writing & running code

Use VSCode to open the remote control and create a py file under the work_dir path where you can write python code directly. Then use the code below to import the py file written into the master node and run it using spark.

In [None]:

/opt/spark/bin/spark-submit --master spark://172.17.0.2:7077 /opt/spark/work-dir/test.py

We give two sample test py-files below.

In [None]:
# Spark Call Test
from pyspark.sql import SparkSession

# Creating a SparkSession
spark = SparkSession.builder \
    .appName("Spark Test") \
    .getOrCreate()

# Create a sample dataset

import random
import string

random.seed(1)  # Setting the random seed

names = [''.join(random.choices(string.ascii_uppercase, k=5)) for _ in range(1000000)]
ages = [random.randint(20, 60) for _ in range(1000000)]

data = list(zip(names, ages))

df = spark.createDataFrame(data, ["Name", "Age"])

# Printing dataset contents
df.show()

# Counting people older than 30
count = df.filter(df.Age > 30).count()
print("Number of people with age greater than 30:", count)

# Close SparkSession
spark.stop()

In [None]:
# Online connection test
import requests

def get_data_from_api(api_url):
    response = requests.get(api_url)

    if response.status_code == 200:
        return response.json()
    else:
        return None

api_url = 'https://jsonplaceholder.typicode.com/posts'  # Here's a sample API
data = get_data_from_api(api_url)

if data is not None:
    print('Cool')
else:
    print('Failed to get data from API')

Now you can view your output in the results bar.

# Restart the cluster

Reconnecting the nodes is required after restarting the cluster.

# Open the nodes in docker and coding in CMD as administrator


In [None]:
Master:
docker exec -it spark_master bash
cd ..
./bin/spark-class org.apache.spark.deploy.master.Master

Worker:
docker exec -it spark_worker bash
cd ..
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://172.17.0.2:7077

And then open VSCode to connect to the cluster and utilize the former run-py command to run the py file.