# Spark Session
When creating a Spark session, you can specify various arguments to customize its behavior. 
Here are some commonly used arguments for configuring a Spark session:

appName: 
    Sets the name of the Spark application.
    Example: .appName("My Spark Application")

master: 
    Specifies the cluster manager to connect to. 
    It can be set to "local" for running Spark in local mode or a URL for connecting to a remote cluster.
    Example: .master("local")

config: 
    Allows you to set various configuration options for Spark. 
    These options can be specified as key-value pairs.
    Example: .config("spark.some.config.option", "some-value")

enableHiveSupport: Ena
    bles Hive support in the Spark session, allowing you to use Hive's SQL dialect and Hive metastore.
    Example: .enableHiveSupport()

spark.executor.memory: 
    Sets the amount of memory to be allocated per executor. 
    It can be specified with a size suffix such as "g" for gigabytes or "m" for megabytes.
    Example: .config("spark.executor.memory", "4g")

spark.driver.memory: 
    Sets the amount of memory to be allocated for the Spark driver program.
    Example: .config("spark.driver.memory", "2g")

spark.sql.shuffle.partitions: 
    Sets the number of partitions to be used when shuffling data in Spark SQL.
    Example: .config("spark.sql.shuffle.partitions", "200")

spark.sql.catalogImplementation: 
    Sets the catalog implementation for Spark SQL. 
    It can be set to "hive" for Hive catalog or "in-memory" for an in-memory catalog.
    Example: .config("spark.sql.catalogImplementation", "hive")

spark.jars: 
    Specifies a comma-separated list of JAR files to be distributed with the Spark application.
    Example: .config("spark.jars", "path/to/jar1,path/to/jar2")

These are just a few examples of the arguments you can pass to a Spark session. 
There are many more configuration options available, depending on your specific requirements and use case. 
You can refer to the Spark documentation for a comprehensive list of configuration options and their descriptions: 
    https://spark.apache.org/docs/latest/configuration.html

In [1]:
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName("My Spark Application") \
    .master("local") \
    .config("spark.executor.memory", "4g") \
    .config("spark.driver.memory", "2g") \
    .getOrCreate()

# Now you can use the 'spark' object to interact with Spark

23/05/30 12:35:31 WARN Utils: Your hostname, FM-PC-LT-323 resolves to a loopback address: 127.0.1.1; using 192.168.18.19 instead (on interface wlp0s20f3)
23/05/30 12:35:31 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/05/30 12:35:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In the example above, 
appName sets the name of the Spark application to "My Spark Application". 
master is set to "local" to run Spark in local mode. The config method is used to set additional configuration options. 
In this case, we set the executor memory to 4 gigabytes (spark.executor.memory) and 
the driver memory to 2 gigabytes (spark.driver.memory).

You can add more config lines to set additional configuration options as needed.

Finally, the getOrCreate() method ensures that if a Spark session with the same app name and 
configuration options already exists, it will be returned instead of creating a new one. 
This allows reusing an existing Spark session or creating a new one if it doesn't exist.

Note that the code snippet above is for Python. 
If you are using Spark with a different programming language,
such as Scala or Java, the syntax may be slightly different, but the concepts remain the same.

In [2]:
# Create a list of data
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]

# Create a DataFrame from the data
df = spark.createDataFrame(data, ["Name", "Age"])

# Show the DataFrame
df.take(2)

[Row(Name='Alice', Age=25), Row(Name='Bob', Age=30)]

In [3]:
log_df =spark.read.format("csv").option("header","true").load("data/Log.csv")
log_df.take(2)

[Row(Id='1', Correlationid='66641e13-d19f-4ce5-aafd-9d5d7befa557', Operationname='Delete SQL database', Status='Succeeded', Eventcategory='Administrative', Level='Informational', Time='2021-06-15T04:44:38.223Z', Subscription='20c6eec9-2d80-4700-b0f6-4fde579a8783', Eventinitiatedby='Microsoft Azure Synapse Resource Provider', Resourcetype='Microsoft.Sql/servers/databases', Resourcegroup='synapseworkspace-managedrg-bd2eb25e-aba7-4f43-a25e-d8757194930d'),
 Row(Id='2', Correlationid='66641e13-d19f-4ce5-aafd-9d5d7befa557', Operationname='Delete SQL database', Status='Started', Eventcategory='Administrative', Level='Informational', Time='2021-06-15T04:44:21.547Z', Subscription='20c6eec9-2d80-4700-b0f6-4fde579a8783', Eventinitiatedby='Microsoft Azure Synapse Resource Provider', Resourcetype='Microsoft.Sql/servers/databases', Resourcegroup='synapseworkspace-managedrg-bd2eb25e-aba7-4f43-a25e-d8757194930d')]

In [4]:
log_df.take(5)

[Row(Id='1', Correlationid='66641e13-d19f-4ce5-aafd-9d5d7befa557', Operationname='Delete SQL database', Status='Succeeded', Eventcategory='Administrative', Level='Informational', Time='2021-06-15T04:44:38.223Z', Subscription='20c6eec9-2d80-4700-b0f6-4fde579a8783', Eventinitiatedby='Microsoft Azure Synapse Resource Provider', Resourcetype='Microsoft.Sql/servers/databases', Resourcegroup='synapseworkspace-managedrg-bd2eb25e-aba7-4f43-a25e-d8757194930d'),
 Row(Id='2', Correlationid='66641e13-d19f-4ce5-aafd-9d5d7befa557', Operationname='Delete SQL database', Status='Started', Eventcategory='Administrative', Level='Informational', Time='2021-06-15T04:44:21.547Z', Subscription='20c6eec9-2d80-4700-b0f6-4fde579a8783', Eventinitiatedby='Microsoft Azure Synapse Resource Provider', Resourcetype='Microsoft.Sql/servers/databases', Resourcegroup='synapseworkspace-managedrg-bd2eb25e-aba7-4f43-a25e-d8757194930d'),
 Row(Id='3', Correlationid='66641e13-d19f-4ce5-aafd-9d5d7befa557', Operationname='Delete

In [5]:
log_df = spark.read\
    .option("inferSchema", "true")\
    .option("header", "true")\
    .csv("data/Log.csv")
log_df.take(5)

[Row(Id=1, Correlationid='66641e13-d19f-4ce5-aafd-9d5d7befa557', Operationname='Delete SQL database', Status='Succeeded', Eventcategory='Administrative', Level='Informational', Time=datetime.datetime(2021, 6, 15, 10, 29, 38, 223000), Subscription='20c6eec9-2d80-4700-b0f6-4fde579a8783', Eventinitiatedby='Microsoft Azure Synapse Resource Provider', Resourcetype='Microsoft.Sql/servers/databases', Resourcegroup='synapseworkspace-managedrg-bd2eb25e-aba7-4f43-a25e-d8757194930d'),
 Row(Id=2, Correlationid='66641e13-d19f-4ce5-aafd-9d5d7befa557', Operationname='Delete SQL database', Status='Started', Eventcategory='Administrative', Level='Informational', Time=datetime.datetime(2021, 6, 15, 10, 29, 21, 547000), Subscription='20c6eec9-2d80-4700-b0f6-4fde579a8783', Eventinitiatedby='Microsoft Azure Synapse Resource Provider', Resourcetype='Microsoft.Sql/servers/databases', Resourcegroup='synapseworkspace-managedrg-bd2eb25e-aba7-4f43-a25e-d8757194930d'),
 Row(Id=3, Correlationid='66641e13-d19f-4ce5

In [6]:
log_df.sort("Correlationid").explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [Correlationid#83 ASC NULLS FIRST], true, 0
   +- Exchange rangepartitioning(Correlationid#83 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [plan_id=72]
      +- FileScan csv [Id#82,Correlationid#83,Operationname#84,Status#85,Eventcategory#86,Level#87,Time#88,Subscription#89,Eventinitiatedby#90,Resourcetype#91,Resourcegroup#92] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/home/fm-pc-lt-323/FuseM/pySpark/PySpark Certification/data/Log.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<Id:int,Correlationid:string,Operationname:string,Status:string,Eventcategory:string,Level:...




In [7]:
spark.conf.set("spark.sql.shuffle.partitions", "5")
log_df.sort("Correlationid").take(3)

[Row(Id=39, Correlationid='02c57e3c-6a26-4e7c-a0dd-523bae7a210d', Operationname='Create Deployment', Status='Started', Eventcategory='Administrative', Level='Informational', Time=datetime.datetime(2021, 6, 14, 19, 28, 31, 168000), Subscription='20c6eec9-2d80-4700-b0f6-4fde579a8783', Eventinitiatedby='techsup1000@gmail.com', Resourcetype='Microsoft.Resources/deployments', Resourcegroup='new-grp'),
 Row(Id=41, Correlationid='02c57e3c-6a26-4e7c-a0dd-523bae7a210d', Operationname='Create or Update Dataset', Status='Started', Eventcategory='Administrative', Level='Informational', Time=datetime.datetime(2021, 6, 14, 19, 28, 36, 597000), Subscription='20c6eec9-2d80-4700-b0f6-4fde579a8783', Eventinitiatedby='techsup1000@gmail.com', Resourcetype='Microsoft.DataFactory/factories/datasets', Resourcegroup='new-grp'),
 Row(Id=40, Correlationid='02c57e3c-6a26-4e7c-a0dd-523bae7a210d', Operationname='Create Deployment', Status='Accepted', Eventcategory='Administrative', Level='Informational', Time=date

In [8]:
log_df.createOrReplaceTempView("log_df_t")

In [9]:
sqlWay = spark.sql("""
SELECT Resourcetype, count(Resourcetype)
FROM log_df_t
GROUP BY Resourcetype
""")

In [10]:
dataFrameWay = log_df.groupBy("Resourcetype").count()

In [11]:
sqlWay.explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[Resourcetype#91], functions=[count(Resourcetype#91)])
   +- Exchange hashpartitioning(Resourcetype#91, 5), ENSURE_REQUIREMENTS, [plan_id=94]
      +- HashAggregate(keys=[Resourcetype#91], functions=[partial_count(Resourcetype#91)])
         +- FileScan csv [Resourcetype#91] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/home/fm-pc-lt-323/FuseM/pySpark/PySpark Certification/data/Log.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<Resourcetype:string>




In [12]:
dataFrameWay.explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[Resourcetype#91], functions=[count(1)])
   +- Exchange hashpartitioning(Resourcetype#91, 5), ENSURE_REQUIREMENTS, [plan_id=107]
      +- HashAggregate(keys=[Resourcetype#91], functions=[partial_count(1)])
         +- FileScan csv [Resourcetype#91] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/home/fm-pc-lt-323/FuseM/pySpark/PySpark Certification/data/Log.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<Resourcetype:string>




In [13]:
log_df.printSchema()

root
 |-- Id: integer (nullable = true)
 |-- Correlationid: string (nullable = true)
 |-- Operationname: string (nullable = true)
 |-- Status: string (nullable = true)
 |-- Eventcategory: string (nullable = true)
 |-- Level: string (nullable = true)
 |-- Time: timestamp (nullable = true)
 |-- Subscription: string (nullable = true)
 |-- Eventinitiatedby: string (nullable = true)
 |-- Resourcetype: string (nullable = true)
 |-- Resourcegroup: string (nullable = true)



In [14]:
spark.range(6).collect()

[Row(id=0), Row(id=1), Row(id=2), Row(id=3), Row(id=4), Row(id=5)]

In [15]:
from pyspark.sql import Row
blog_row = Row('1','babu','23')
blog_row[1]

'babu'

In [16]:
rows = [Row('2','ram','23'),Row('4','raja','32')]
author_df = spark.createDataFrame(rows,['id','name','roll'])
author_df.show()

+---+----+----+
| id|name|roll|
+---+----+----+
|  2| ram|  23|
|  4|raja|  32|
+---+----+----+



In [17]:
spark.read.format("csv").load("data/Log.csv").schema

StructType([StructField('_c0', StringType(), True), StructField('_c1', StringType(), True), StructField('_c2', StringType(), True), StructField('_c3', StringType(), True), StructField('_c4', StringType(), True), StructField('_c5', StringType(), True), StructField('_c6', StringType(), True), StructField('_c7', StringType(), True), StructField('_c8', StringType(), True), StructField('_c9', StringType(), True), StructField('_c10', StringType(), True)])

In [18]:
# imporing
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("Basic Structured Operations").getOrCreate()

23/05/30 12:35:37 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [19]:
random_data_df=spark.read.format('csv').option("header","true").load("data/random_data.csv")
random_data_df.show(5)

+---------------+--------------+--------------------+--------------------+---------+-------------+-----------+----+--------------------+-----------+--------+------------+
|           name|         phone|               email|             address|postalZip|       region|    country|list|                text|numberrange|currency|alphanumeric|
+---------------+--------------+--------------------+--------------------+---------+-------------+-----------+----+--------------------+-----------+--------+------------+
|  Aurelia Combs|(818) 147-3806|purus.gravida@icl...|951-7278 Risus. Road|    62744|    Innlandet|    Ukraine| 100|lorem, eget molli...|          1|  $40.00| BJO33IPL2AV|
|   Cairo Church|1-566-216-0485|velit.aliquam@pro...|   2397 Lacinia. Rd.|   741616|      Cartago|    Belgium| 100|Donec dignissim m...|          6|  $36.93| EDD86ZGW5PX|
|Halee Christian|1-756-649-5978|orci.quis@protonm...|  4158 Lobortis. Av.| YV70 6RE|Northern Cape|     Poland| 100|eget lacus. Mauri...|         

In [20]:
random_data_df.printSchema()

root
 |-- name: string (nullable = true)
 |-- phone: string (nullable = true)
 |-- email: string (nullable = true)
 |-- address: string (nullable = true)
 |-- postalZip: string (nullable = true)
 |-- region: string (nullable = true)
 |-- country: string (nullable = true)
 |-- list: string (nullable = true)
 |-- text: string (nullable = true)
 |-- numberrange: string (nullable = true)
 |-- currency: string (nullable = true)
 |-- alphanumeric: string (nullable = true)



In [21]:
from pyspark.sql.types import StructField, StructType, StringType, LongType,IntegerType

myManualSchema = StructType([
    StructField("name", StringType(), True),
    StructField("phone", StringType(), True),
    StructField("email", StringType(), True),
    StructField("postalZip", StringType(), True),
    StructField("region", StringType(), True),
    StructField("country", StringType(), True),
    StructField("list", StringType(), True),
    StructField("text", StringType(), True),
    StructField("numberrange", StringType(), True),
    StructField("currency", StringType(), True),
    StructField("alphanumeric", StringType(), True),
    ])

random_df = spark.read.format("csv").schema(myManualSchema)\
.load("data/random_data.csv")

In [23]:
random_df.show(5)

+---------------+--------------+--------------------+--------------------+---------+-------------+-----------+----+--------------------+-----------+------------+
|           name|         phone|               email|           postalZip|   region|      country|       list|text|         numberrange|   currency|alphanumeric|
+---------------+--------------+--------------------+--------------------+---------+-------------+-----------+----+--------------------+-----------+------------+
|           name|         phone|               email|             address|postalZip|       region|    country|list|                text|numberrange|    currency|
|  Aurelia Combs|(818) 147-3806|purus.gravida@icl...|951-7278 Risus. Road|    62744|    Innlandet|    Ukraine| 100|lorem, eget molli...|          1|      $40.00|
|   Cairo Church|1-566-216-0485|velit.aliquam@pro...|   2397 Lacinia. Rd.|   741616|      Cartago|    Belgium| 100|Donec dignissim m...|          6|      $36.93|
|Halee Christian|1-756-649-5

In [42]:
random_data_df.withColumnRenamed("name", "full_name").show(5)

+---------------+--------------+--------------------+--------------------+---------+-------------+-----------+----+--------------------+-----------+--------+------------+
|      full_name|         phone|               email|             address|postalZip|       region|    country|list|                text|numberrange|currency|alphanumeric|
+---------------+--------------+--------------------+--------------------+---------+-------------+-----------+----+--------------------+-----------+--------+------------+
|  Aurelia Combs|(818) 147-3806|purus.gravida@icl...|951-7278 Risus. Road|    62744|    Innlandet|    Ukraine| 100|lorem, eget molli...|          1|  $40.00| BJO33IPL2AV|
|   Cairo Church|1-566-216-0485|velit.aliquam@pro...|   2397 Lacinia. Rd.|   741616|      Cartago|    Belgium| 100|Donec dignissim m...|          6|  $36.93| EDD86ZGW5PX|
|Halee Christian|1-756-649-5978|orci.quis@protonm...|  4158 Lobortis. Av.| YV70 6RE|Northern Cape|     Poland| 100|eget lacus. Mauri...|         

In [41]:
from pyspark.sql.functions import *
random_data_df\
    .withColumn('first_name', split(random_data_df['name'], ' ').getItem(0))\
    .withColumn('last_name', split(random_data_df['name'], ' ').getItem(1))\
    .show(5)

+---------------+--------------+--------------------+--------------------+---------+-------------+-----------+----+--------------------+-----------+--------+------------+----------+---------+
|           name|         phone|               email|             address|postalZip|       region|    country|list|                text|numberrange|currency|alphanumeric|first_name|last_name|
+---------------+--------------+--------------------+--------------------+---------+-------------+-----------+----+--------------------+-----------+--------+------------+----------+---------+
|  Aurelia Combs|(818) 147-3806|purus.gravida@icl...|951-7278 Risus. Road|    62744|    Innlandet|    Ukraine| 100|lorem, eget molli...|          1|  $40.00| BJO33IPL2AV|   Aurelia|    Combs|
|   Cairo Church|1-566-216-0485|velit.aliquam@pro...|   2397 Lacinia. Rd.|   741616|      Cartago|    Belgium| 100|Donec dignissim m...|          6|  $36.93| EDD86ZGW5PX|     Cairo|   Church|
|Halee Christian|1-756-649-5978|orci.qui

In [46]:
random_data_df.drop("region","list","text").columns

['name',
 'phone',
 'email',
 'address',
 'postalZip',
 'country',
 'numberrange',
 'currency',
 'alphanumeric']

In [47]:
random_data_df.withColumn("list", col("list").cast("Int"))

DataFrame[name: string, phone: string, email: string, address: string, postalZip: string, region: string, country: string, list: int, text: string, numberrange: string, currency: string, alphanumeric: string]

In [53]:
random_data_df.filter(col("country") == "Ukraine").show(2)

+-----------------+--------------+--------------------+--------------------+---------+---------------+-------+----+--------------------+-----------+--------+------------+
|             name|         phone|               email|             address|postalZip|         region|country|list|                text|numberrange|currency|alphanumeric|
+-----------------+--------------+--------------------+--------------------+---------+---------------+-------+----+--------------------+-----------+--------+------------+
|    Aurelia Combs|(818) 147-3806|purus.gravida@icl...|951-7278 Risus. Road|    62744|      Innlandet|Ukraine| 100|lorem, eget molli...|          1|  $40.00| BJO33IPL2AV|
|Valentine O'Neill|(364) 583-4329|pharetra.sed@hotm...|Ap #490-2719 Dict...|  6955 IU|New South Wales|Ukraine| 100|nibh dolor, nonum...|          2|  $61.17| USS48ETS7HG|
+-----------------+--------------+--------------------+--------------------+---------+---------------+-------+----+--------------------+---------

In [55]:
random_data_df.where("list ==100").show(2)

+-------------+--------------+--------------------+--------------------+---------+---------+-------+----+--------------------+-----------+--------+------------+
|         name|         phone|               email|             address|postalZip|   region|country|list|                text|numberrange|currency|alphanumeric|
+-------------+--------------+--------------------+--------------------+---------+---------+-------+----+--------------------+-----------+--------+------------+
|Aurelia Combs|(818) 147-3806|purus.gravida@icl...|951-7278 Risus. Road|    62744|Innlandet|Ukraine| 100|lorem, eget molli...|          1|  $40.00| BJO33IPL2AV|
| Cairo Church|1-566-216-0485|velit.aliquam@pro...|   2397 Lacinia. Rd.|   741616|  Cartago|Belgium| 100|Donec dignissim m...|          6|  $36.93| EDD86ZGW5PX|
+-------------+--------------+--------------------+--------------------+---------+---------+-------+----+--------------------+-----------+--------+------------+
only showing top 2 rows



In [72]:

# use this 
random_data_df\
    .filter(col("country") == "Ukraine")\
    .filter(col("postalZip")=="741616")\
    .show(2)

+----+-----+-----+-------+---------+------+-------+----+----+-----------+--------+------------+
|name|phone|email|address|postalZip|region|country|list|text|numberrange|currency|alphanumeric|
+----+-----+-----+-------+---------+------+-------+----+----+-----------+--------+------------+
+----+-----+-----+-------+---------+------+-------+----+----+-----------+--------+------------+



In [76]:

random_data_df\
    .where((col("country") == "Ukraine") & (col("list")==100))\
    .show(2)

random_data_df\
    .where(col("country") == "Ukraine")\
    .where(col("list")==100)\
    .show(2)

+-----------------+--------------+--------------------+--------------------+---------+---------------+-------+----+--------------------+-----------+--------+------------+
|             name|         phone|               email|             address|postalZip|         region|country|list|                text|numberrange|currency|alphanumeric|
+-----------------+--------------+--------------------+--------------------+---------+---------------+-------+----+--------------------+-----------+--------+------------+
|    Aurelia Combs|(818) 147-3806|purus.gravida@icl...|951-7278 Risus. Road|    62744|      Innlandet|Ukraine| 100|lorem, eget molli...|          1|  $40.00| BJO33IPL2AV|
|Valentine O'Neill|(364) 583-4329|pharetra.sed@hotm...|Ap #490-2719 Dict...|  6955 IU|New South Wales|Ukraine| 100|nibh dolor, nonum...|          2|  $61.17| USS48ETS7HG|
+-----------------+--------------+--------------------+--------------------+---------+---------------+-------+----+--------------------+---------

In [77]:
random_data_df\
    .filter((col("country") == "Ukraine") & (col("list")==100))\
    .show(2)
# use this 
random_data_df\
    .filter(col("country") == "Ukraine")\
    .filter(col("list")==100)\
    .show(2)
#---------------------------------------------------------------
random_data_df\
    .where((col("country") == "Ukraine") & (col("list")==100))\
    .show(2)
# use this
random_data_df\
    .where(col("country") == "Ukraine")\
    .where(col("list")==100)\
    .show(2)

+-----------------+--------------+--------------------+--------------------+---------+---------------+-------+----+--------------------+-----------+--------+------------+
|             name|         phone|               email|             address|postalZip|         region|country|list|                text|numberrange|currency|alphanumeric|
+-----------------+--------------+--------------------+--------------------+---------+---------------+-------+----+--------------------+-----------+--------+------------+
|    Aurelia Combs|(818) 147-3806|purus.gravida@icl...|951-7278 Risus. Road|    62744|      Innlandet|Ukraine| 100|lorem, eget molli...|          1|  $40.00| BJO33IPL2AV|
|Valentine O'Neill|(364) 583-4329|pharetra.sed@hotm...|Ap #490-2719 Dict...|  6955 IU|New South Wales|Ukraine| 100|nibh dolor, nonum...|          2|  $61.17| USS48ETS7HG|
+-----------------+--------------+--------------------+--------------------+---------+---------------+-------+----+--------------------+---------

In [85]:
random_data_df.distinct().count()

100

In [87]:
random_data_df.distinct().count()
random_data_df.select("name","postalZip","list").distinct().count()

100

In [88]:
random_data_df.sort("name").show(5)
random_data_df.orderBy("name", "email").show(5)
random_data_df.orderBy(col("name"), col("email")).show(5)

+---------------+--------------+--------------------+--------------------+---------+----------------+--------------+----+--------------------+-----------+--------+------------+
|           name|         phone|               email|             address|postalZip|          region|       country|list|                text|numberrange|currency|alphanumeric|
+---------------+--------------+--------------------+--------------------+---------+----------------+--------------+----+--------------------+-----------+--------+------------+
|      Adam Rush|1-608-833-8760|erat.neque@yahoo....|        7483 Non Av.|     2022|          Marche|         Chile| 100|gravida molestie ...|          9|  $13.22| SZP14JQK8UK|
|   Adele Barton|(243) 622-4148|      id@hotmail.net|P.O. Box 354, 745...|     2386| West-Vlaanderen|United Kingdom| 100|libero. Integer i...|          6|   $1.67| KDF56DBK5RX|
|   Adena Daniel|(851) 463-8228|vestibulum.mauris...|746-7651 Dapibus Av.|   403695|      Vorarlberg|       Germany

In [91]:
from pyspark.sql.functions import desc, asc
random_data_df.orderBy(expr("name desc")).show(2)
random_data_df.orderBy(col("name").desc(), col("email").asc()).show(2)

+------------+--------------+--------------------+--------------------+---------+---------------+--------------+----+--------------------+-----------+--------+------------+
|        name|         phone|               email|             address|postalZip|         region|       country|list|                text|numberrange|currency|alphanumeric|
+------------+--------------+--------------------+--------------------+---------+---------------+--------------+----+--------------------+-----------+--------+------------+
|   Adam Rush|1-608-833-8760|erat.neque@yahoo....|        7483 Non Av.|     2022|         Marche|         Chile| 100|gravida molestie ...|          9|  $13.22| SZP14JQK8UK|
|Adele Barton|(243) 622-4148|      id@hotmail.net|P.O. Box 354, 745...|     2386|West-Vlaanderen|United Kingdom| 100|libero. Integer i...|          6|   $1.67| KDF56DBK5RX|
+------------+--------------+--------------------+--------------------+---------+---------------+--------------+----+------------------

In [93]:
random_data_df.limit(5).show()

random_data_df.orderBy(expr("name desc")).limit(6).show()

+---------------+--------------+--------------------+--------------------+---------+-------------+-----------+----+--------------------+-----------+--------+------------+
|           name|         phone|               email|             address|postalZip|       region|    country|list|                text|numberrange|currency|alphanumeric|
+---------------+--------------+--------------------+--------------------+---------+-------------+-----------+----+--------------------+-----------+--------+------------+
|  Aurelia Combs|(818) 147-3806|purus.gravida@icl...|951-7278 Risus. Road|    62744|    Innlandet|    Ukraine| 100|lorem, eget molli...|          1|  $40.00| BJO33IPL2AV|
|   Cairo Church|1-566-216-0485|velit.aliquam@pro...|   2397 Lacinia. Rd.|   741616|      Cartago|    Belgium| 100|Donec dignissim m...|          6|  $36.93| EDD86ZGW5PX|
|Halee Christian|1-756-649-5978|orci.quis@protonm...|  4158 Lobortis. Av.| YV70 6RE|Northern Cape|     Poland| 100|eget lacus. Mauri...|         

In [108]:
random_data_df.select(expr("name AS name_and_surname")).show(2)

random_data_df.select(expr("name as full_name").alias("my_full_name")).show(2)

random_data_df.selectExpr("email as my_email", "postalZip as Postal_code").show(2)

random_data_df.selectExpr("avg(postalZip) as Average_postal_code").show()

+----------------+
|name_and_surname|
+----------------+
|   Aurelia Combs|
|    Cairo Church|
+----------------+
only showing top 2 rows

+-------------+
| my_full_name|
+-------------+
|Aurelia Combs|
| Cairo Church|
+-------------+
only showing top 2 rows

+--------------------+-----------+
|            my_email|Postal_code|
+--------------------+-----------+
|purus.gravida@icl...|      62744|
|velit.aliquam@pro...|     741616|
+--------------------+-----------+
only showing top 2 rows

+-------------------+
|Average_postal_code|
+-------------------+
| 194690.55263157896|
+-------------------+



In [110]:
from pyspark.sql.functions import lit
random_data_df.select(expr("*"), lit(1).alias("One")).show(2)

+-------------+--------------+--------------------+--------------------+---------+---------+-------+----+--------------------+-----------+--------+------------+---+
|         name|         phone|               email|             address|postalZip|   region|country|list|                text|numberrange|currency|alphanumeric|One|
+-------------+--------------+--------------------+--------------------+---------+---------+-------+----+--------------------+-----------+--------+------------+---+
|Aurelia Combs|(818) 147-3806|purus.gravida@icl...|951-7278 Risus. Road|    62744|Innlandet|Ukraine| 100|lorem, eget molli...|          1|  $40.00| BJO33IPL2AV|  1|
| Cairo Church|1-566-216-0485|velit.aliquam@pro...|   2397 Lacinia. Rd.|   741616|  Cartago|Belgium| 100|Donec dignissim m...|          6|  $36.93| EDD86ZGW5PX|  1|
+-------------+--------------+--------------------+--------------------+---------+---------+-------+----+--------------------+-----------+--------+------------+---+
only showi

In [113]:
seed = 5
withReplacement = False
fraction = 0.5
random_data_df.sample(withReplacement, fraction, seed).count()

56

In [128]:
dataFrames = random_data_df.randomSplit([0.4, 0.6], seed)
dataFrames[0].show()
dataFrames[1].show()
dataFrames[0].count() > dataFrames[1].count()

+----------------+--------------+--------------------+--------------------+---------+--------------------+--------------+----+--------------------+-----------+--------+------------+
|            name|         phone|               email|             address|postalZip|              region|       country|list|                text|numberrange|currency|alphanumeric|
+----------------+--------------+--------------------+--------------------+---------+--------------------+--------------+----+--------------------+-----------+--------+------------+
|       Adam Rush|1-608-833-8760|erat.neque@yahoo....|        7483 Non Av.|     2022|              Marche|         Chile| 100|gravida molestie ...|          9|  $13.22| SZP14JQK8UK|
|    Adele Barton|(243) 622-4148|      id@hotmail.net|P.O. Box 354, 745...|     2386|     West-Vlaanderen|United Kingdom| 100|libero. Integer i...|          6|   $1.67| KDF56DBK5RX|
| Althea Davidson|(358) 674-9179|    class@yahoo.couk|Ap #116-9362 Dapi...|   523311|     

False

In [131]:
dataFrames[0].union(dataFrames[1]).show()

+----------------+--------------+--------------------+--------------------+---------+--------------------+--------------+----+--------------------+-----------+--------+------------+
|            name|         phone|               email|             address|postalZip|              region|       country|list|                text|numberrange|currency|alphanumeric|
+----------------+--------------+--------------------+--------------------+---------+--------------------+--------------+----+--------------------+-----------+--------+------------+
|       Adam Rush|1-608-833-8760|erat.neque@yahoo....|        7483 Non Av.|     2022|              Marche|         Chile| 100|gravida molestie ...|          9|  $13.22| SZP14JQK8UK|
|    Adele Barton|(243) 622-4148|      id@hotmail.net|P.O. Box 354, 745...|     2386|     West-Vlaanderen|United Kingdom| 100|libero. Integer i...|          6|   $1.67| KDF56DBK5RX|
| Althea Davidson|(358) 674-9179|    class@yahoo.couk|Ap #116-9362 Dapi...|   523311|     

In [134]:
# Create a list of data
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]

# Create a DataFrame from the data
df1 = spark.createDataFrame(data, ["Name", "Age"])

# Show the DataFrame
df1.show(2)

from pyspark.sql import Row
schema = df1.schema
newRows = [
    Row("New Country",85),
    Row("New Country 2",851)
]
parallelizedRows = spark.sparkContext.parallelize(newRows)
newDF = spark.createDataFrame(parallelizedRows, schema)

df1.union(newDF).show()

+-----+---+
| Name|Age|
+-----+---+
|Alice| 25|
|  Bob| 30|
+-----+---+
only showing top 2 rows

+-------------+---+
|         Name|Age|
+-------------+---+
|        Alice| 25|
|          Bob| 30|
|      Charlie| 35|
|  New Country| 85|
|New Country 2|851|
+-------------+---+



In [148]:
from pyspark.sql.functions import expr
expr("(((list + 5) * 200) - 6)")

Column<'(((list + 5) * 200) - 6)'>

In [154]:
random_data_df.repartition(5)
random_data_df.take(2)

[Row(name='Aurelia Combs', phone='(818) 147-3806', email='purus.gravida@icloud.com', address='951-7278 Risus. Road', postalZip='62744', region='Innlandet', country='Ukraine', list='100', text='lorem, eget mollis lectus pede et risus. Quisque libero lacus,', numberrange='1', currency='$40.00', alphanumeric='BJO33IPL2AV'),
 Row(name='Cairo Church', phone='1-566-216-0485', email='velit.aliquam@protonmail.com', address='2397 Lacinia. Rd.', postalZip='741616', region='Cartago', country='Belgium', list='100', text='Donec dignissim magna a tortor. Nunc commodo auctor velit. Aliquam', numberrange='6', currency='$36.93', alphanumeric='EDD86ZGW5PX')]

In [156]:
random_data_df.repartition(col("name"))


DataFrame[name: string, phone: string, email: string, address: string, postalZip: string, region: string, country: string, list: string, text: string, numberrange: string, currency: string, alphanumeric: string]

In [164]:
random_data_df.repartition(5, col("name"))

random_data_df.take(2)

[Row(name='Aurelia Combs', phone='(818) 147-3806', email='purus.gravida@icloud.com', address='951-7278 Risus. Road', postalZip='62744', region='Innlandet', country='Ukraine', list='100', text='lorem, eget mollis lectus pede et risus. Quisque libero lacus,', numberrange='1', currency='$40.00', alphanumeric='BJO33IPL2AV'),
 Row(name='Cairo Church', phone='1-566-216-0485', email='velit.aliquam@protonmail.com', address='2397 Lacinia. Rd.', postalZip='741616', region='Cartago', country='Belgium', list='100', text='Donec dignissim magna a tortor. Nunc commodo auctor velit. Aliquam', numberrange='6', currency='$36.93', alphanumeric='EDD86ZGW5PX')]

In [173]:
random_data_df.repartition(5, col("name")).coalesce(3)
random_data_df.take(4)

[Row(name='Aurelia Combs', phone='(818) 147-3806', email='purus.gravida@icloud.com', address='951-7278 Risus. Road', postalZip='62744', region='Innlandet', country='Ukraine', list='100', text='lorem, eget mollis lectus pede et risus. Quisque libero lacus,', numberrange='1', currency='$40.00', alphanumeric='BJO33IPL2AV'),
 Row(name='Cairo Church', phone='1-566-216-0485', email='velit.aliquam@protonmail.com', address='2397 Lacinia. Rd.', postalZip='741616', region='Cartago', country='Belgium', list='100', text='Donec dignissim magna a tortor. Nunc commodo auctor velit. Aliquam', numberrange='6', currency='$36.93', alphanumeric='EDD86ZGW5PX'),
 Row(name='Halee Christian', phone='1-756-649-5978', email='orci.quis@protonmail.couk', address='4158 Lobortis. Av.', postalZip='YV70 6RE', region='Northern Cape', country='Poland', list='100', text='eget lacus. Mauris non dui nec urna suscipit nonummy. Fusce', numberrange='6', currency='$3.70', alphanumeric='TFB01UUD2VJ'),
 Row(name='Rhoda Shepard'

In [174]:
collectDF = random_data_df.limit(10)
collectDF.take(5) # take works with an Integer count
collectDF.show() # this prints it out nicely
collectDF.show(5, False)
collectDF.collect()

+---------------+--------------+--------------------+--------------------+-----------+--------------------+-----------+----+--------------------+-----------+--------+------------+
|           name|         phone|               email|             address|  postalZip|              region|    country|list|                text|numberrange|currency|alphanumeric|
+---------------+--------------+--------------------+--------------------+-----------+--------------------+-----------+----+--------------------+-----------+--------+------------+
|  Aurelia Combs|(818) 147-3806|purus.gravida@icl...|951-7278 Risus. Road|      62744|           Innlandet|    Ukraine| 100|lorem, eget molli...|          1|  $40.00| BJO33IPL2AV|
|   Cairo Church|1-566-216-0485|velit.aliquam@pro...|   2397 Lacinia. Rd.|     741616|             Cartago|    Belgium| 100|Donec dignissim m...|          6|  $36.93| EDD86ZGW5PX|
|Halee Christian|1-756-649-5978|orci.quis@protonm...|  4158 Lobortis. Av.|   YV70 6RE|       Norther

[Row(name='Aurelia Combs', phone='(818) 147-3806', email='purus.gravida@icloud.com', address='951-7278 Risus. Road', postalZip='62744', region='Innlandet', country='Ukraine', list='100', text='lorem, eget mollis lectus pede et risus. Quisque libero lacus,', numberrange='1', currency='$40.00', alphanumeric='BJO33IPL2AV'),
 Row(name='Cairo Church', phone='1-566-216-0485', email='velit.aliquam@protonmail.com', address='2397 Lacinia. Rd.', postalZip='741616', region='Cartago', country='Belgium', list='100', text='Donec dignissim magna a tortor. Nunc commodo auctor velit. Aliquam', numberrange='6', currency='$36.93', alphanumeric='EDD86ZGW5PX'),
 Row(name='Halee Christian', phone='1-756-649-5978', email='orci.quis@protonmail.couk', address='4158 Lobortis. Av.', postalZip='YV70 6RE', region='Northern Cape', country='Poland', list='100', text='eget lacus. Mauris non dui nec urna suscipit nonummy. Fusce', numberrange='6', currency='$3.70', alphanumeric='TFB01UUD2VJ'),
 Row(name='Rhoda Shepard'

In [175]:
collectDF.toLocalIterator()

<generator object _local_iterator_from_socket.<locals>.PyLocalIterable.__iter__ at 0x7f3c4b3c1700>

In [176]:
collectDF = random_data_df.limit(40)
collectDF.toLocalIterator()

<generator object _local_iterator_from_socket.<locals>.PyLocalIterable.__iter__ at 0x7f3c4b3c1230>