# Importing Libraries


In [87]:
from pyspark.sql import SparkSession, Row
from pyspark.storagelevel import StorageLevel
from pyspark.sql.functions import col, desc, asc
from pyspark.sql.types import StringType, StructField, StructType, IntegerType

# Creating a Spark Session

> Builder is a class whereas builder initializes the Builder class


In [88]:
spark = SparkSession.builder.appName("Spark Basics").getOrCreate()

# Storage Levels in Spark

## Memory only Storage level

StorageLevel.MEMORY_ONLY is the default behavior of the RDD cache() method and stores the RDD or DataFrame as deserialized objects to JVM memory. When there is not enough memory available it will not save DataFrame of some partitions and these will be re-computed as and when required.

This takes more memory. but unlike RDD, this would be slower than MEMORY_AND_DISK level as it recomputes the unsaved partitions, and recomputing the in-memory columnar representation of the underlying table is expensive.

## Serialize in Memory

StorageLevel.MEMORY_ONLY_SER is the same as MEMORY_ONLY but the difference being it stores `RDD as serialized objects to JVM memory.` It takes lesser memory (space-efficient) than MEMORY_ONLY as it saves objects as serialized and takes an additional few more CPU cycles in order to deserialize.

## Memory only and Replicate

StorageLevel.MEMORY_ONLY_2 is same as MEMORY_ONLY storage level but replicate each partition to two cluster nodes.

## Serialized in Memory and Replicate

StorageLevel.MEMORY_ONLY_SER_2 is same as MEMORY_ONLY_SER storage level but replicate each partition to two cluster nodes.

## Memory and Disk Storage level

StorageLevel.MEMORY_AND_DISK is the default behavior of the DataFrame or Dataset. In this Storage Level, The DataFrame will be stored in JVM memory as deserialized objects. When required storage is greater than available memory, it stores some of the excess partitions into a disk and reads the data from the disk when required. It is slower as there is I/O involved.

## Serialize in Memory and Disk

StorageLevel.MEMORY_AND_DISK_SER is same as MEMORY_AND_DISK storage level difference being it serializes the DataFrame objects in memory and on disk when space is not available.

## Memory, Disk and Replicate

StorageLevel.MEMORY_AND_DISK_2 is Same as MEMORY_AND_DISK storage level but replicate each partition to two cluster nodes.

## Serialize in Memory, Disk and Replicate

StorageLevel.MEMORY_AND_DISK_SER_2 is same as MEMORY_AND_DISK_SER storage level but replicate each partition to two cluster nodes.

## Disk only storage level

In StorageLevel.DISK_ONLY storage level, DataFrame is stored only on disk and the CPU computation time is high as I/O involved.

## Disk only and Replicate

StorageLevel.DISK_ONLY_2 is same as DISK_ONLY storage level but replicate each partition to two cluster nodes.

| Storage Level       | Space used | CPU time | In memory | On-disk | Serialized | Recompute some partitions |
| ------------------- | ---------- | -------- | --------- | ------- | ---------- | ------------------------- |
| MEMORY_ONLY         | High       | Low      | Y         | N       | N          | Y                         |
| MEMORY_ONLY_SER     | Low        | High     | Y         | N       | Y          | Y                         |
| MEMORY_AND_DISK     | High       | Medium   | Some      | Some    | Some       | N                         |
| MEMORY_AND_DISK_SER | Low        | High     | Some      | Some    | Y          | N                         |
| DISK_ONLY           | Low        | High     | N         | Y       | Y          | N                         |


# Reading a Dataset


## Reaing people.json file


In [89]:
people_df = spark.read.json("./datasets/people.json")

## Reading Files with custom Schema


In [90]:
data_schema = [
    StructField(name="age", dataType=IntegerType(), nullable=True),
    StructField(name="name", dataType=StringType(), nullable=True),
]
final_structure = StructType(fields=data_schema)

people_df_custom_schema = spark.read.json(
    "./datasets/people.json", schema=final_structure
)

# Dataframe Methods


## DataFrame.show() → None[source]

Parameters

1. n int, optional Number of rows to show.

2. truncate bool or int, optional If set to True, truncate strings longer than 20 chars by default. If set to a number greater than one, truncates long strings to length truncate and align cells right.
3. vertical bool, optional If set to True, print output rows vertically (one line per column value).


In [91]:
people_df.show()

+----+-------+
| age|   name|
+----+-------+
|NULL|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



In [92]:
people_df.show(truncate=False, n=2, vertical=True)

-RECORD 0-------
 age  | NULL    
 name | Michael 
-RECORD 1-------
 age  | 30      
 name | Andy    
only showing top 2 rows



## DataFrame.printSchema() → None

1. level int, optional, default -> None : How many levels to print for nested schemas.


In [93]:
people_df.printSchema()

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)



In [94]:
nested_df = spark.createDataFrame([(1, (2, 2))], ["a", "b"])

In [95]:
nested_df.printSchema(1)
nested_df.printSchema(2)
del nested_df

root
 |-- a: long (nullable = true)
 |-- b: struct (nullable = true)

root
 |-- a: long (nullable = true)
 |-- b: struct (nullable = true)
 |    |-- _1: long (nullable = true)
 |    |-- _2: long (nullable = true)



## DataFrame.describe() → pyspark.sql.dataframe.DataFrame

cols str, list, optional Column name or list of column names to describe by (default All columns).


In [96]:
people_df.describe().show()

+-------+------------------+-------+
|summary|               age|   name|
+-------+------------------+-------+
|  count|                 2|      3|
|   mean|              24.5|   NULL|
| stddev|7.7781745930520225|   NULL|
|    min|                19|   Andy|
|    max|                30|Michael|
+-------+------------------+-------+



In [97]:
people_df.describe(["age"]).show()

+-------+------------------+
|summary|               age|
+-------+------------------+
|  count|                 2|
|   mean|              24.5|
| stddev|7.7781745930520225|
|    min|                19|
|    max|                30|
+-------+------------------+



## DataFrame.agg() → pyspark.sql.dataframe.DataFrame

Aggregate on the entire DataFrame without groups (shorthand for df.groupBy().agg()).

### Parameters

    exprs Column or dict of key and value strings, Columns or expressions to aggregate DataFrame by.

### Returns

    DataFrame: Aggregated DataFrame.


In [98]:
people_df.agg({"age": "max"}).show()
# people_df.groupBy().agg({"age": "max"}).show() # same

+--------+
|max(age)|
+--------+
|      30|
+--------+



## DataFrame.alias(alias: str) → pyspark.sql.dataframe.DataFrame

### Parameters

    alias str: an alias name to be set for the DataFrame.

### Returns

    DataFrame: Aliased DataFrame.


In [99]:
people_df_alias_1 = people_df.alias("people_df_1")
people_df_alias_2 = people_df.alias("people_df_2")

people_df_alias_1.join(
    people_df_alias_2, col("people_df_1.name") == col("people_df_2.name"), how="inner"
).show()

+----+-------+----+-------+
| age|   name| age|   name|
+----+-------+----+-------+
|NULL|Michael|NULL|Michael|
|  30|   Andy|  30|   Andy|
|  19| Justin|  19| Justin|
+----+-------+----+-------+



## DataFrame.approxQuantile() → Union[List[float], List[List[float]]]

Calculates the approximate quantiles of numerical columns of a DataFrame.

The result of this algorithm has the following deterministic bound: If the DataFrame has N elements and if we request the quantile at probability p up to error err, then the algorithm will return a sample x from the DataFrame so that the exact rank of x is close to (p \* N). More precisely,

floor((p - err) _ N) <= rank(x) <= ceil((p + err) _ N).

This method implements a variation of the Greenwald-Khanna algorithm (with some speed optimizations). The algorithm was first present in [[https://doi.org/10.1145/375663.375670 Space-efficient Online Computation of Quantile Summaries]] by Greenwald and Khanna.

### Parameters

    1. col: str, tuple or list: Can be a single column name, or a list of names for multiple columns.

    2. probabilities list or tuple: a list of quantile probabilities Each number must belong to [0, 1]. For example 0 is the minimum, 0.5 is the median, 1 is the maximum.

    3. relativeError float: The relative target precision to achieve (>= 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but gives the same result as 1.

### Returns

    list
    the approximate quantiles at the given probabilities.

    If the input col is a string, the output is a list of floats.

    If the input col is a list or tuple of strings, the output is also a
    list, but each element in it is a list of floats, i.e., the output is a list of list of floats.


In [100]:
people_df.approxQuantile(col="age", probabilities=[0.5], relativeError=0.25)

[19.0]

## DataFrame.cache() → pyspark.sql.dataframe.DataFrame[source]

Persists the DataFrame with the default storage level (MEMORY_AND_DISK).

### Returns

    DataFrame: Cached DataFrame.


In [101]:
people_df.cache()

DataFrame[age: bigint, name: string]

## DataFrame.persist() → pyspark.sql.dataframe.DataFrame

### Parameters

    storageLevel StorageLevel: Storage level to set for persistence. Default is MEMORY_AND_DISK_DESER.

### Returns

    DataFrame: Persisted DataFrame.


In [102]:
people_df.persist(StorageLevel.MEMORY_AND_DISK)

DataFrame[age: bigint, name: string]

## DataFrame.checkpoint() → pyspark.sql.dataframe.DataFrame

Returns a checkpointed version of this DataFrame. Checkpointing can be used to truncate the logical plan of this DataFrame, which is especially useful in iterative algorithms where the plan may grow exponentially. It will be saved to files inside the checkpoint directory set with SparkContext.setCheckpointDir().

### Parameters

    eager bool, optional, default True: Whether to checkpoint this DataFrame immediately.

### Returns

    DataFrame: Checkpointed DataFrame.


## DataFrame.coalesce(numPartitions: int) → pyspark.sql.dataframe.DataFrame[source]

Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions. If a larger number of partitions is requested, it will stay at the current number of partitions.

However, if you’re doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition(). This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).

### Parameters

    numPartitions int
    specify the target number of partitions

### Returns

    DataFrame


In [103]:
people_df.coalesce(1).rdd.getNumPartitions()

1

## DataFrame.colRegex(colName: str) → pyspark.sql.column.Column

### Parameters

    colName str
    string, column name specified as a regex.

### Returns

    Column


In [104]:
people_df.select(people_df.colRegex("`(name)?+.+`")).show()

+----+
| age|
+----+
|NULL|
|  30|
|  19|
+----+



## DataFrame.collect() → List[pyspark.sql.types.Row]

Returns all the records as a list of Row.

### Returns

    list
    List of rows.


In [105]:
people_df.collect()

[Row(age=None, name='Michael'),
 Row(age=30, name='Andy'),
 Row(age=19, name='Justin')]

## DataFrame.corr() → float

Calculates the correlation of two columns of a DataFrame as a double value. Currently only supports the Pearson Correlation Coefficient. DataFrame.corr() and DataFrameStatFunctions.corr() are aliases of each other.

### Parameters

    col1 str
    The name of the first column

    col2 str
    The name of the second column

    method str, optional
    The correlation method. Currently only supports “pearson”

### Returns

    float
    Pearson Correlation Coefficient of two columns.


In [106]:
# people_df.corr("age", "name")

## DataFrame.createGlobalTempView(name: str) → None

Creates a global temporary view with this DataFrame.
The lifetime of this temporary view is tied to this `Spark application`. throws TempTableAlreadyExistsException, if the view name already exists in the catalog.

### Parameters

    name str
    Name of the view.


In [107]:
# people_df.createGlobalTempView("people_df_global_temp_view")
spark.sql("select * from global_temp.people_df_global_temp_view").show()

+----+-------+
| age|   name|
+----+-------+
|NULL|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



In [108]:
people_df.createOrReplaceGlobalTempView("people_df_global_temp_view")
spark.sql("select * from global_temp.people_df_global_temp_view").show()

+----+-------+
| age|   name|
+----+-------+
|NULL|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



## DataFrame.createTempView(name: str) → None

Creates a local temporary view with this DataFrame.

The lifetime of this temporary table is tied to the `SparkSession` that was used to create this DataFrame.throws TempTableAlreadyExistsException, if the view name already exists in the catalog.

### Parameters

    name str
    Name of the view.


In [109]:
# people_df.createTempView("people_df_temp_view")
spark.sql("select * from people_df_temp_view").show()

+----+-------+
| age|   name|
+----+-------+
|NULL|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



In [110]:
people_df.createOrReplaceTempView("people_df_temp_view")
spark.sql("select * from people_df_temp_view").show()

+----+-------+
| age|   name|
+----+-------+
|NULL|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



## DataFrame.crossJoin(other: pyspark.sql.dataframe.DataFrame) → pyspark.sql.dataframe.DataFrame

Returns the cartesian product with another DataFrame.

### Parameters

    other DataFrame
    Right side of the cartesian product.

### Returns

    DataFrame
    Joined DataFrame.


In [113]:
people_df.crossJoin(people_df.select("age")).show()

+----+-------+----+
| age|   name| age|
+----+-------+----+
|NULL|Michael|NULL|
|NULL|Michael|  30|
|NULL|Michael|  19|
|  30|   Andy|NULL|
|  30|   Andy|  30|
|  30|   Andy|  19|
|  19| Justin|NULL|
|  19| Justin|  30|
|  19| Justin|  19|
+----+-------+----+



# DataFrame Attributes


## columns

Retrieves the names of all columns in the DataFrame as a list.


In [None]:
people_df.columns

['age', 'name']

## dtypes

Returns all column names and their data types as a list.


In [None]:
people_df.dtypes

[('age', 'bigint'), ('name', 'string')]

## isStreaming

Returns True if this DataFrame contains one or more sources that continuously return data as it arrives.


In [None]:
people_df.isStreaming

False

## na

Returns a DataFrameNaFunctions for handling missing values.

> class pyspark.sql.DataFrameNaFunctions(df: pyspark.sql.dataframe.DataFrame)

1. drop([how, thresh, subset]): Returns a new DataFrame omitting rows with null values.
   1. howstr, optional ‘any’ or ‘all’. If ‘any’, drop a row if it contains any nulls. If ‘all’, drop a row only if all its values are null.
   2. thresh: int, optional default None If specified, drop rows that have less than thresh non-null values. This overwrites the how parameter.
   3. subset str, tuple or list, optional optional list of column names to consider.
2. fill(value[, subset]): Replace null values, alias for na.fill().

   1. value int, float, string, bool or dict Value to replace null values with. If the value is a dict, then subset is ignored and value must be a mapping from column name (string) to replacement value. The replacement value must be an int, float, boolean, or string.

   2. subset str, tuple or list, optional optional list of column names to consider. Columns specified in subset that do not have matching data types are ignored. For example, if value is a string, and subset contains a non-string column, then the non-string column is simply ignored.

3. replace(to_replace[, value, subset]): Returns a new DataFrame replacing a value with another value.

   1. to_replace bool, int, float, string, list or dict Value to be replaced. If the value is a dict, then value is ignored or can be omitted, and to_replace must be a mapping between a value and a replacement.

   2. value bool, int, float, string or None, optional The replacement value must be a bool, int, float, string or None. If value is a list, value should be of the same length and type as to_replace. If value is a scalar and to_replace is a sequence, then value is used as a replacement for each item in to_replace.

   3. subset list, optional optional list of column names to consider. Columns specified in subset that do not have matching data types are ignored. For example, if value is a string, and subset contains a non-string column, then the non-string column is simply ignored.


In [None]:
people_df.na.fill(50).show()

+---+-------+
|age|   name|
+---+-------+
| 50|Michael|
| 30|   Andy|
| 19| Justin|
+---+-------+



## schema

Returns the schema of this DataFrame as a pyspark.sql.types.StructType.


In [None]:
people_df.schema

StructType([StructField('age', LongType(), True), StructField('name', StringType(), True)])