# We can expect one or more questions on each function from the below list

In [1]:
import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pyspark.sql.types as T

In [2]:
spark = SparkSession.builder\
      .master("local")\
      .appName("expected_question_pyspark")\
      .getOrCreate()

23/10/31 09:40:35 WARN Utils: Your hostname, FM-PC-LT-323 resolves to a loopback address: 127.0.1.1; using 172.16.5.219 instead (on interface wlp0s20f3)
23/10/31 09:40:35 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/10/31 09:40:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


# Reading csv file and printing the schema of each dataframe

In [3]:
csv_file_path= 'data/department.csv'
department_df = spark.read.csv(
    csv_file_path,
    header=True,             
    inferSchema=True,
    sep=",",
    encoding="UTF-8",
    nullValue="NA",
    dateFormat="yyyy-MM-dd",
    timestampFormat="yyyy-MM-dd HH:mm:ss",
    mode="PERMISSIVE"
)
department_df.printSchema()
csv_file_path= 'data/employee.csv'
employee_df = spark.read.csv(
    csv_file_path,
    header=True,             
    inferSchema=True,
    sep=",",
    encoding="UTF-8",
    nullValue="NA",
    dateFormat="yyyy-MM-dd",
    timestampFormat="yyyy-MM-dd HH:mm:ss",
    mode="PERMISSIVE"
)
employee_df.printSchema()
csv_file_path= 'data/Log.csv'
log_df = spark.read.csv(
    csv_file_path,
    header=True,             
    inferSchema=True,
    sep=",",
    encoding="UTF-8",
    nullValue="NA",
    dateFormat="yyyy-MM-dd",
    timestampFormat="yyyy-MM-dd HH:mm:ss",
    mode="PERMISSIVE"
)
log_df.printSchema()

root
 |-- d_id: integer (nullable = true)
 |-- dept_name: string (nullable = true)

root
 |-- eid: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- post: string (nullable = true)
 |-- salary: integer (nullable = true)
 |-- did: integer (nullable = true)

root
 |-- Id: integer (nullable = true)
 |-- Correlationid: string (nullable = true)
 |-- Operationname: string (nullable = true)
 |-- Status: string (nullable = true)
 |-- Eventcategory: string (nullable = true)
 |-- Level: string (nullable = true)
 |-- Time: string (nullable = true)
 |-- Subscription: string (nullable = true)
 |-- Eventinitiatedby: string (nullable = true)
 |-- Resourcetype: string (nullable = true)
 |-- Resourcegroup: string (nullable = true)



# Actions

### In PySpark, "Actions" are operations that trigger the actual execution of transformations and return results to the driver program or display them in the console. 

The collect action retrieves all the elements from a distributed collection (like a DataFrame or an RDD) and returns them as an array to the driver program. This action should be used with caution because it brings all the data to the driver, and if the dataset is large, it can lead to memory issues.

In [107]:
department_df.collect()

[Row(d_id=101, dept_name='Engineering'),
 Row(d_id=102, dept_name='IT'),
 Row(d_id=103, dept_name='Design'),
 Row(d_id=104, dept_name='Support'),
 Row(d_id=105, dept_name='Network'),
 Row(d_id=106, dept_name='Software')]

The show action displays the first n rows of a DataFrame in a tabular format. The truncate parameter determines whether to truncate long strings in the output for better readability.

In [108]:
department_df.show()

+----+-----------+
|d_id|  dept_name|
+----+-----------+
| 101|Engineering|
| 102|         IT|
| 103|     Design|
| 104|    Support|
| 105|    Network|
| 106|   Software|
+----+-----------+



The first action returns the first element (or the top element) of a DataFrame or RDD. It's useful when you want to quickly inspect the first row of your data.

In [109]:
department_df.first()

Row(d_id=101, dept_name='Engineering')

The count action counts the number of elements in a DataFrame or RDD and returns an integer representing the count.

In [110]:
department_df.count()

6

The tail action effectively allows you to get the last 'n' rows of a DataFrame.

In [111]:
department_df.tail(3)

[Row(d_id=104, dept_name='Support'),
 Row(d_id=105, dept_name='Network'),
 Row(d_id=106, dept_name='Software')]

 The take action returns the first n elements of a DataFrame or RDD as an array. It's similar to head, but it returns an array instead of displaying the data.

In [112]:
department_df.take(3)

[Row(d_id=101, dept_name='Engineering'),
 Row(d_id=102, dept_name='IT'),
 Row(d_id=103, dept_name='Design')]

The head action returns the first n elements of a DataFrame or RDD as an array. This is similar to first, but it allows you to specify how many elements you want to retrieve.

department_df.head(3)
The toLocalIterator action returns an iterator over the elements in a DataFrame or RDD. It's useful when you want to process the elements locally on the driver program. Be cautious with this action if the dataset is large because it will bring the data to the driver.

In [113]:
iterator = department_df.toLocalIterator()
for row in iterator:
    department_id = row.d_id
    department_nmae = row.dept_name
    print(f"Department ID: {department_id}, Department Name: {department_nmae}")


Department ID: 101, Department Name: Engineering
Department ID: 102, Department Name: IT
Department ID: 103, Department Name: Design
Department ID: 104, Department Name: Support
Department ID: 105, Department Name: Network
Department ID: 106, Department Name: Software


# Typed Transformations 

### Typed Transformations: coalesce, distinct, dropDuplicates, filter, limit, orderBy, repartition, sample, select, sort, union, unionAll, where, repartition,

### Showing Dataframe and Schema

In [None]:
log_df.printSchema()
log_df.show()

## repartition
### https://sparkbyexamples.com/spark/spark-repartition-vs-coalesce/

In [115]:
current_partition = log_df.rdd.getNumPartitions()
current_partition

1

In [116]:
num_partition = 6
repartition_df = log_df.repartition(num_partition)
current_partition = repartition_df.rdd.getNumPartitions()
current_partition

6

## coalesce

In [117]:
current_partition = repartition_df.rdd.getNumPartitions()
current_partition

6

In [119]:
num_partition = 3
coalesce_df = repartition_df.coalesce(num_partition)
current_partition = coalesce_df.rdd.getNumPartitions()
current_partition

3

## repartition by column name

In [129]:
current_partition = log_df.rdd.getNumPartitions()
current_partition


1

In [131]:
num_partition = 3
repatition_by_column_df = log_df.repartition(num_partition,"Operationname")
current_partition = repatition_by_column_df.rdd.getNumPartitions()
current_partition

3

In [135]:
num_partition = 4
repatition_by_column_df = log_df.repartition(num_partition,"Operationname","Status")
current_partition = repatition_by_column_df.rdd.getNumPartitions()
current_partition

4

In [138]:
unique_statuses = log_df.select("Operationname").distinct().rdd.flatMap(lambda x: x).collect()
num_partition = len(unique_statuses)
repartitioned_df = log_df.repartition(num_partition, "Operationname")
current_partition = repartitioned_df.rdd.getNumPartitions()
current_partition

12

## Filter 

In [86]:
filtered_df = log_df.filter(
    (F.col("Status") == "Succeeded") &
    (F.col("Operationname").isin(["Delete SQL database", "Delete SqlPools"]))
)
filtered_df.show()
filtered_df.count()

+---+--------------------+-------------------+---------+--------------+-------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| Id|       Correlationid|      Operationname|   Status| Eventcategory|        Level|                Time|        Subscription|    Eventinitiatedby|        Resourcetype|       Resourcegroup|
+---+--------------------+-------------------+---------+--------------+-------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|  1|66641e13-d19f-4ce...|Delete SQL database|Succeeded|Administrative|Informational|2021-06-15T04:44:...|20c6eec9-2d80-470...|Microsoft Azure S...|Microsoft.Sql/ser...|synapseworkspace-...|
|  4|e2958162-93d9-464...|    Delete SqlPools|Succeeded|Administrative|Informational|2021-06-15T04:44:...|20c6eec9-2d80-470...|techsup1000@gmail...|Microsoft.Synapse...|             new-grp|
+---+--------------------+-------------------

2

In [91]:
filtered_df = log_df.filter(
    (F.col("Status") == "Succeeded") |
    (F.col("Operationname").isin(["Delete SQL database", "Delete SqlPools"]))
)
filtered_df.limit(16).show()
filtered_df.limit(18).count()

+---+--------------------+--------------------+---------+--------------+-------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| Id|       Correlationid|       Operationname|   Status| Eventcategory|        Level|                Time|        Subscription|    Eventinitiatedby|        Resourcetype|       Resourcegroup|
+---+--------------------+--------------------+---------+--------------+-------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|  1|66641e13-d19f-4ce...| Delete SQL database|Succeeded|Administrative|Informational|2021-06-15T04:44:...|20c6eec9-2d80-470...|Microsoft Azure S...|Microsoft.Sql/ser...|synapseworkspace-...|
|  2|66641e13-d19f-4ce...| Delete SQL database|  Started|Administrative|Informational|2021-06-15T04:44:...|20c6eec9-2d80-470...|Microsoft Azure S...|Microsoft.Sql/ser...|synapseworkspace-...|
|  3|66641e13-d19f-4ce...| Delete SQL da

18

## Where 

In [139]:
where_df = log_df.where(
    (F.col("Status") == "Succeeded") &
    (F.col("Operationname").isin(["Delete SQL database", "Delete SqlPools"]))
)
where_df.show()
where_df.count()

+---+--------------------+-------------------+---------+--------------+-------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| Id|       Correlationid|      Operationname|   Status| Eventcategory|        Level|                Time|        Subscription|    Eventinitiatedby|        Resourcetype|       Resourcegroup|
+---+--------------------+-------------------+---------+--------------+-------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|  1|66641e13-d19f-4ce...|Delete SQL database|Succeeded|Administrative|Informational|2021-06-15T04:44:...|20c6eec9-2d80-470...|Microsoft Azure S...|Microsoft.Sql/ser...|synapseworkspace-...|
|  4|e2958162-93d9-464...|    Delete SqlPools|Succeeded|Administrative|Informational|2021-06-15T04:44:...|20c6eec9-2d80-470...|techsup1000@gmail...|Microsoft.Synapse...|             new-grp|
+---+--------------------+-------------------

2

In [140]:
where_df = log_df.where(
    (F.col("Status") == "Succeeded") |
    (F.col("Operationname").isin(["Delete SQL database", "Delete SqlPools"]))
)
where_df.limit(16).show()
where_df.limit(18).count()

+---+--------------------+--------------------+---------+--------------+-------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| Id|       Correlationid|       Operationname|   Status| Eventcategory|        Level|                Time|        Subscription|    Eventinitiatedby|        Resourcetype|       Resourcegroup|
+---+--------------------+--------------------+---------+--------------+-------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|  1|66641e13-d19f-4ce...| Delete SQL database|Succeeded|Administrative|Informational|2021-06-15T04:44:...|20c6eec9-2d80-470...|Microsoft Azure S...|Microsoft.Sql/ser...|synapseworkspace-...|
|  2|66641e13-d19f-4ce...| Delete SQL database|  Started|Administrative|Informational|2021-06-15T04:44:...|20c6eec9-2d80-470...|Microsoft Azure S...|Microsoft.Sql/ser...|synapseworkspace-...|
|  3|66641e13-d19f-4ce...| Delete SQL da

18

## Selecting

In [141]:
part_log_df = log_df.select("Correlationid", "Operationname","Eventcategory")
part_log_df.show()
part_log_df.count()

+--------------------+--------------------+--------------+
|       Correlationid|       Operationname| Eventcategory|
+--------------------+--------------------+--------------+
|66641e13-d19f-4ce...| Delete SQL database|Administrative|
|66641e13-d19f-4ce...| Delete SQL database|Administrative|
|66641e13-d19f-4ce...| Delete SQL database|Administrative|
|e2958162-93d9-464...|     Delete SqlPools|Administrative|
|e2958162-93d9-464...|     Delete SqlPools|Administrative|
|e2958162-93d9-464...|     Delete SqlPools|Administrative|
|08cd2e19-477c-4ec...|Pause SQL Analyti...|Administrative|
|08cd2e19-477c-4ec...|Pause SQL Analyti...|Administrative|
|08cd2e19-477c-4ec...|Pause SQL Analyti...|Administrative|
|d2d9d7c4-2766-4e7...|Pause a Datawareh...|Administrative|
|d2d9d7c4-2766-4e7...|Pause a Datawareh...|Administrative|
|d2d9d7c4-2766-4e7...|Pause a Datawareh...|Administrative|
|1c735927-517e-470...| Create Pipeline Run|Administrative|
|1c735927-517e-470...| Create Pipeline Run|Administrativ

49

## Duplicates and Distincts

In [76]:
distinct_log_df = part_log_df.distinct()
distinct_log_df.show()
distinct_log_df.count()

+--------------------+--------------------+--------------+
|       Correlationid|       Operationname| Eventcategory|
+--------------------+--------------------+--------------+
|1c735927-517e-470...| Create Pipeline Run|Administrative|
|02c57e3c-6a26-4e7...|Create or update ...|Administrative|
|781dc10c-a838-46c...|Create or Update ...|Administrative|
|e2725ed8-301d-4ff...|List Storage Acco...|Administrative|
|638aec0d-c9f8-47a...|Create or Update ...|Administrative|
|977afba3-bc1e-4f4...| Create Pipeline Run|Administrative|
|66641e13-d19f-4ce...| Delete SQL database|Administrative|
|5b41a078-0ff1-40c...|   Delete Deployment|Administrative|
|02c57e3c-6a26-4e7...|   Create Deployment|Administrative|
|1593862e-3db3-475...|   Create Deployment|Administrative|
|072d5d31-b4b0-4bd...|Create or Update ...|Administrative|
|e2958162-93d9-464...|     Delete SqlPools|Administrative|
|6b8133f8-62b5-4e5...|   Delete Deployment|Administrative|
|08cd2e19-477c-4ec...|Pause SQL Analyti...|Administrativ

21

In [78]:
drop_log_df = part_log_df.dropDuplicates()
drop_log_df.show()
drop_log_df.count()

+--------------------+--------------------+--------------+
|       Correlationid|       Operationname| Eventcategory|
+--------------------+--------------------+--------------+
|1c735927-517e-470...| Create Pipeline Run|Administrative|
|02c57e3c-6a26-4e7...|Create or update ...|Administrative|
|781dc10c-a838-46c...|Create or Update ...|Administrative|
|e2725ed8-301d-4ff...|List Storage Acco...|Administrative|
|638aec0d-c9f8-47a...|Create or Update ...|Administrative|
|977afba3-bc1e-4f4...| Create Pipeline Run|Administrative|
|66641e13-d19f-4ce...| Delete SQL database|Administrative|
|5b41a078-0ff1-40c...|   Delete Deployment|Administrative|
|02c57e3c-6a26-4e7...|   Create Deployment|Administrative|
|1593862e-3db3-475...|   Create Deployment|Administrative|
|072d5d31-b4b0-4bd...|Create or Update ...|Administrative|
|e2958162-93d9-464...|     Delete SqlPools|Administrative|
|6b8133f8-62b5-4e5...|   Delete Deployment|Administrative|
|08cd2e19-477c-4ec...|Pause SQL Analyti...|Administrativ

21

In [79]:
duplicate_df = part_log_df.exceptAll(part_log_df.dropDuplicates())
duplicate_df.show()
duplicate_df.count()

+--------------------+--------------------+--------------+
|       Correlationid|       Operationname| Eventcategory|
+--------------------+--------------------+--------------+
|1c735927-517e-470...| Create Pipeline Run|Administrative|
|02c57e3c-6a26-4e7...|Create or update ...|Administrative|
|781dc10c-a838-46c...|Create or Update ...|Administrative|
|e2725ed8-301d-4ff...|List Storage Acco...|Administrative|
|638aec0d-c9f8-47a...|Create or Update ...|Administrative|
|977afba3-bc1e-4f4...| Create Pipeline Run|Administrative|
|66641e13-d19f-4ce...| Delete SQL database|Administrative|
|66641e13-d19f-4ce...| Delete SQL database|Administrative|
|02c57e3c-6a26-4e7...|   Create Deployment|Administrative|
|02c57e3c-6a26-4e7...|   Create Deployment|Administrative|
|1593862e-3db3-475...|   Create Deployment|Administrative|
|1593862e-3db3-475...|   Create Deployment|Administrative|
|072d5d31-b4b0-4bd...|Create or Update ...|Administrative|
|e2958162-93d9-464...|     Delete SqlPools|Administrativ

28

## Sorting (asc and desc)

In [148]:
# sort_asc_df = log_df.sort("Operationname")
sort_asc_df = log_df.sort(F.col("Operationname").asc())
sort_asc_df.show()

+---+--------------------+--------------------+---------+--------------+-------------+--------------------+--------------------+--------------------+--------------------+-------------+
| Id|       Correlationid|       Operationname|   Status| Eventcategory|        Level|                Time|        Subscription|    Eventinitiatedby|        Resourcetype|Resourcegroup|
+---+--------------------+--------------------+---------+--------------+-------------+--------------------+--------------------+--------------------+--------------------+-------------+
| 31|1593862e-3db3-475...|   Create Deployment|Succeeded|Administrative|Informational|2021-06-14T13:44:...|20c6eec9-2d80-470...|techsup1000@gmail...|Microsoft.Resourc...|      new-grp|
| 32|1593862e-3db3-475...|   Create Deployment|  Started|Administrative|Informational|2021-06-14T13:43:...|20c6eec9-2d80-470...|techsup1000@gmail...|Microsoft.Resourc...|      new-grp|
| 33|1593862e-3db3-475...|   Create Deployment| Accepted|Administrative|Inf

In [151]:
sort_asc_df = log_df.sort(F.col("Operationname").asc(),F.col("Status").asc())
sort_asc_df.show()

+---+--------------------+--------------------+---------+--------------+-------------+--------------------+--------------------+--------------------+--------------------+-------------+
| Id|       Correlationid|       Operationname|   Status| Eventcategory|        Level|                Time|        Subscription|    Eventinitiatedby|        Resourcetype|Resourcegroup|
+---+--------------------+--------------------+---------+--------------+-------------+--------------------+--------------------+--------------------+--------------------+-------------+
| 33|1593862e-3db3-475...|   Create Deployment| Accepted|Administrative|Informational|2021-06-14T13:43:...|20c6eec9-2d80-470...|techsup1000@gmail...|Microsoft.Resourc...|      new-grp|
| 40|02c57e3c-6a26-4e7...|   Create Deployment| Accepted|Administrative|Informational|2021-06-14T13:43:...|20c6eec9-2d80-470...|techsup1000@gmail...|Microsoft.Resourc...|      new-grp|
| 32|1593862e-3db3-475...|   Create Deployment|  Started|Administrative|Inf

In [152]:
sort_asc_df = log_df.sort(F.col("Operationname").asc(),F.col("Status").desc())
sort_asc_df.show()

+---+--------------------+--------------------+---------+--------------+-------------+--------------------+--------------------+--------------------+--------------------+-------------+
| Id|       Correlationid|       Operationname|   Status| Eventcategory|        Level|                Time|        Subscription|    Eventinitiatedby|        Resourcetype|Resourcegroup|
+---+--------------------+--------------------+---------+--------------+-------------+--------------------+--------------------+--------------------+--------------------+-------------+
| 31|1593862e-3db3-475...|   Create Deployment|Succeeded|Administrative|Informational|2021-06-14T13:44:...|20c6eec9-2d80-470...|techsup1000@gmail...|Microsoft.Resourc...|      new-grp|
| 38|02c57e3c-6a26-4e7...|   Create Deployment|Succeeded|Administrative|Informational|2021-06-14T13:43:...|20c6eec9-2d80-470...|techsup1000@gmail...|Microsoft.Resourc...|      new-grp|
| 32|1593862e-3db3-475...|   Create Deployment|  Started|Administrative|Inf

In [153]:
sort_asc_df = log_df.sort(F.col("Operationname").desc(),F.col("Status").asc())
sort_asc_df.show()

+---+--------------------+--------------------+---------+--------------+-------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| Id|       Correlationid|       Operationname|   Status| Eventcategory|        Level|                Time|        Subscription|    Eventinitiatedby|        Resourcetype|       Resourcegroup|
+---+--------------------+--------------------+---------+--------------+-------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| 12|d2d9d7c4-2766-4e7...|Pause a Datawareh...| Accepted|Administrative|Informational|2021-06-14T17:55:...|20c6eec9-2d80-470...|Microsoft Azure S...|Microsoft.Sql/ser...|synapseworkspace-...|
| 11|d2d9d7c4-2766-4e7...|Pause a Datawareh...|  Started|Administrative|Informational|2021-06-14T17:55:...|20c6eec9-2d80-470...|Microsoft Azure S...|Microsoft.Sql/ser...|synapseworkspace-...|
| 10|d2d9d7c4-2766-4e7...|Pause a Datawa

## union and unionAll
### https://sparkbyexamples.com/spark/spark-dataframe-union-and-union-all/

In [None]:
head_log_df = log_df.sort("Operationname").limit(15)
head_log_df.show()
head_log_df.count()

In [174]:
tail_log_df = log_df.sort("Operationname").limit(10)
tail_log_df.show()
tail_log_df.count()

+---+--------------------+-------------------+---------+--------------+-------------+--------------------+--------------------+--------------------+--------------------+-------------+
| Id|       Correlationid|      Operationname|   Status| Eventcategory|        Level|                Time|        Subscription|    Eventinitiatedby|        Resourcetype|Resourcegroup|
+---+--------------------+-------------------+---------+--------------+-------------+--------------------+--------------------+--------------------+--------------------+-------------+
| 40|02c57e3c-6a26-4e7...|  Create Deployment| Accepted|Administrative|Informational|2021-06-14T13:43:...|20c6eec9-2d80-470...|techsup1000@gmail...|Microsoft.Resourc...|      new-grp|
| 39|02c57e3c-6a26-4e7...|  Create Deployment|  Started|Administrative|Informational|2021-06-14T13:43:...|20c6eec9-2d80-470...|techsup1000@gmail...|Microsoft.Resourc...|      new-grp|
| 31|1593862e-3db3-475...|  Create Deployment|Succeeded|Administrative|Informati

10

In [175]:
top_10_df = head_log_df.union(tail_log_df)
top_10_df.show()
top_10_df.count()

+---+--------------------+--------------------+---------+--------------+-------------+--------------------+--------------------+--------------------+--------------------+-------------+
| Id|       Correlationid|       Operationname|   Status| Eventcategory|        Level|                Time|        Subscription|    Eventinitiatedby|        Resourcetype|Resourcegroup|
+---+--------------------+--------------------+---------+--------------+-------------+--------------------+--------------------+--------------------+--------------------+-------------+
| 31|1593862e-3db3-475...|   Create Deployment|Succeeded|Administrative|Informational|2021-06-14T13:44:...|20c6eec9-2d80-470...|techsup1000@gmail...|Microsoft.Resourc...|      new-grp|
| 32|1593862e-3db3-475...|   Create Deployment|  Started|Administrative|Informational|2021-06-14T13:43:...|20c6eec9-2d80-470...|techsup1000@gmail...|Microsoft.Resourc...|      new-grp|
| 33|1593862e-3db3-475...|   Create Deployment| Accepted|Administrative|Inf

25

In [166]:
top_10_df = head_log_df.unionAll(tail_log_df)
top_10_df.show()
top_10_df.count()

+---+--------------------+--------------------+---------+--------------+-------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| Id|       Correlationid|       Operationname|   Status| Eventcategory|        Level|                Time|        Subscription|    Eventinitiatedby|        Resourcetype|       Resourcegroup|
+---+--------------------+--------------------+---------+--------------+-------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|  1|66641e13-d19f-4ce...| Delete SQL database|Succeeded|Administrative|Informational|2021-06-15T04:44:...|20c6eec9-2d80-470...|Microsoft Azure S...|Microsoft.Sql/ser...|synapseworkspace-...|
|  2|66641e13-d19f-4ce...| Delete SQL database|  Started|Administrative|Informational|2021-06-15T04:44:...|20c6eec9-2d80-470...|Microsoft Azure S...|Microsoft.Sql/ser...|synapseworkspace-...|
|  3|66641e13-d19f-4ce...| Delete SQL da

25

In [176]:
top_10_df = head_log_df.union(tail_log_df).distinct()
top_10_df.show()
top_10_df.count()

+---+--------------------+--------------------+---------+--------------+-------------+--------------------+--------------------+--------------------+--------------------+-------------+
| Id|       Correlationid|       Operationname|   Status| Eventcategory|        Level|                Time|        Subscription|    Eventinitiatedby|        Resourcetype|Resourcegroup|
+---+--------------------+--------------------+---------+--------------+-------------+--------------------+--------------------+--------------------+--------------------+-------------+
| 40|02c57e3c-6a26-4e7...|   Create Deployment| Accepted|Administrative|Informational|2021-06-14T13:43:...|20c6eec9-2d80-470...|techsup1000@gmail...|Microsoft.Resourc...|      new-grp|
| 31|1593862e-3db3-475...|   Create Deployment|Succeeded|Administrative|Informational|2021-06-14T13:44:...|20c6eec9-2d80-470...|techsup1000@gmail...|Microsoft.Resourc...|      new-grp|
| 19|77152ae0-297f-4d1...|Create or Update ...|Succeeded|Administrative|Inf

15

# Untyped Transformations: agg, apply, col, drop, groupBy, join, select, withColumn, withColumnRenamed, crossJoin, register, sql

In [5]:
csv_file_path= 'data/random_data.csv'
random_df = spark.read.csv(
    csv_file_path,
    header=True,             
    inferSchema=True,
    sep=",",
    encoding="UTF-8",
    nullValue="NA",
    dateFormat="yyyy-MM-dd",
    timestampFormat="yyyy-MM-dd HH:mm:ss",
    mode="PERMISSIVE"
)
random_df.show()
random_df.printSchema()

+-----------------+--------------+--------------------+--------------------+-----------+--------------------+------------------+----+--------------------+-----------+--------+------------+
|             name|         phone|               email|             address|  postalZip|              region|           country|list|                text|numberrange|currency|alphanumeric|
+-----------------+--------------+--------------------+--------------------+-----------+--------------------+------------------+----+--------------------+-----------+--------+------------+
|    Aurelia Combs|(818) 147-3806|purus.gravida@icl...|951-7278 Risus. Road|      62744|           Innlandet|           Ukraine| 100|lorem, eget molli...|          1|  $40.00| BJO33IPL2AV|
|     Cairo Church|1-566-216-0485|velit.aliquam@pro...|   2397 Lacinia. Rd.|     741616|             Cartago|           Belgium| 100|Donec dignissim m...|          6|  $36.93| EDD86ZGW5PX|
|  Halee Christian|1-756-649-5978|orci.quis@protonm...|

# Aggregate Function: approx_count_distinct, count, first, mean, variance, std_dev


# Collection: explode


# Column: asc, desc, cast


# Date and Time Function: months, unix_timestamp, from_unixtime


# Non Aggregate Function: broadcast, coalesce, col, lit


# Sorting Function: asc, desc


# String Function: split, regex_replace

# UDF Function: udf


# Dataframereader: text, parquet, load, textFile, json, option, format


# DataFrame Na Functions: na.fill, na.drop


# DataFrame Functions: printSchema, createOrReplaceTempView, cache, persist
