<a href="https://colab.research.google.com/github/endophenotype/Spark/blob/main/Spark_%D0%BA%D0%BE%D0%BC%D0%B0%D0%BD%D0%B4%D1%8B_%D0%B4%D0%BB%D1%8F_%D0%B7%D0%B0%D0%B3%D1%80%D1%83%D0%B7%D0%BA%D0%B8_%D1%84%D0%B0%D0%B9%D0%BB%D0%BE%D0%B2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
!tar xf spark-3.1.1-bin-hadoop3.2.tgz
!pip install -q findspark

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"

In [None]:
!ls

sample_data  spark-3.1.1-bin-hadoop3.2	spark-3.1.1-bin-hadoop3.2.tgz


In [None]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
spark.conf.set("spark.sql.repl.eagerEval.enabled", True) # Property used to format output tables better
spark

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession \
 .builder \
 .appName("Python Spark SQL basic example") \
 .config("spark.some.config.option", "some-value") \
 .getOrCreate()
df = spark.read.load("/content/spark-3.1.1-bin-hadoop3.2/examples/src/main/resources/users.parquet")
df.select("name", "favorite_color").write.save("namesAndFavColors.parquet")
df.show()


+------+--------------+----------------+
|  name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa|          null|  [3, 9, 15, 20]|
|   Ben|           red|              []|
+------+--------------+----------------+



Можно использовать для загрузки JSON файлов

In [None]:
spark = SparkSession \
 .builder \
 .appName("Python Spark SQL basic example") \
 .config("spark.some.config.option", "some-value") \
 .getOrCreate()
df = spark.read.load("/content/spark-3.1.1-bin-hadoop3.2/examples/src/main/resources/people.json",
format="json")
df.select("name", "age").write.save("namesAndAges.parquet", format="parquet")
df.show()


+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



Можно использовать для загрузки CSV файлов

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession \
 .builder \
 .appName("Python Spark SQL basic example") \
 .config("spark.some.config.option", "some-value") \
 .getOrCreate()
df = spark.read.load("/content/spark-3.1.1-bin-hadoop3.2/examples/src/main/resources/people.csv", \
 format="csv", sep=";", inferSchema="true", header="true")
df.show()

+-----+---+---------+
| name|age|      job|
+-----+---+---------+
|Jorge| 30|Developer|
|  Bob| 32|Developer|
+-----+---+---------+



Источник данных ORC:

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession \
 .builder \
 .appName("Python Spark SQL basic example") \
 .config("spark.some.config.option", "some-value") \
 .getOrCreate()
df = spark.read.orc("/content/spark-3.1.1-bin-hadoop3.2/examples/src/main/resources/users.orc")
(df.write.format("orc")
 .option("orc.bloom.filter.columns", "favorite_color")
 .option("orc.dictionary.key.threshold", "1.0")
 .option("orc.column.encoding.direct", "name")
 .save("users_with_options.orc"))

Источник данных о parquet:

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession \
 .builder \
 .appName("Python Spark SQL basic example") \
 .config("spark.some.config.option", "some-value") \
 .getOrCreate()
df = spark.read.parquet("/content/spark-3.1.1-bin-hadoop3.2/examples/src/main/resources/users.parquet")
(df.write.format("parquet")
 .option("parquet.bloom.filter.enabled#favorite_color", "true")
 .option("parquet.bloom.filter.expected.ndv#favorite_color", "1000000")
 .option("parquet.enable.dictionary", "true")
 .option("parquet.page.write-checksum.enabled", "false")
 .save("users_with_options.parquet"))


Запуск SQL для файлов напрямую

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession \
 .builder \
 .appName("Python Spark SQL basic example") \
 .config("spark.some.config.option", "some-value") \
 .getOrCreate()
df = spark.sql("SELECT * FROM parquet.`/content/spark-3.1.1-bin-hadoop3.2/examples/src/main/resources/users.parquet`")
df.show()


+------+--------------+----------------+
|  name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa|          null|  [3, 9, 15, 20]|
|   Ben|           red|              []|
+------+--------------+----------------+



In [None]:
from pyspark.sql import SparkSession
spark = SparkSession \
 .builder \
 .appName("Python Spark SQL basic example") \
 .config("spark.some.config.option", "some-value") \
 .getOrCreate()
df = spark.read.parquet("/content/spark-3.1.1-bin-hadoop3.2/examples/src/main/resources/users.parquet")
# $example on:write_sorting_and_bucketing$
df.write.bucketBy(42, "name").sortBy("name").saveAsTable("people_bucketed")