In [1]:
df = spark.read.format("json").load("/work/data/flight-data-2015-summary.json")
df

                                                                                

DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint]

In [2]:
df.show(10)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
|    United States|          Singapore|    1|
|    United States|            Grenada|   62|
|       Costa Rica|      United States|  588|
|          Senegal|      United States|   40|
|          Moldova|      United States|    1|
+-----------------+-------------------+-----+
only showing top 10 rows



## Schemas

A schema defines the column names and types of a DataFrame

In [3]:
df.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)



A schema is a StructType made up of a number of fields, StructFields, that have a name, type, a Boolean flag which specifies whether that column can contain missing or null values, and, finally, users can optionally specify associated metadata with that column

In [4]:
df.schema

StructType(List(StructField(DEST_COUNTRY_NAME,StringType,true),StructField(ORIGIN_COUNTRY_NAME,StringType,true),StructField(count,LongType,true)))

Creating a schema:

In [5]:
from pyspark.sql.types import StructField, StructType, StringType, LongType

manual_schema = StructType([
  StructField("DEST_COUNTRY_NAME", StringType(), True),
  StructField("ORIGIN_COUNTRY_NAME", StringType(), True),
  StructField("count", LongType(), False, metadata={"hello":"world"})
])

Creating a dataframe using the created schema:

In [8]:
df = spark.read.format("json").schema(manual_schema).load("/work/data/flight-data-2015-summary.json")
df.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)



## Columns and expressions

### Columns

Columns are expressed using the col and columns functions (interchangeably)

In [9]:
from pyspark.sql.functions import col, column
col("someColumnName")
column("someColumnName")

Column<b'someColumnName'>

Columns are referenced from dataframes using the indexing operator

In [10]:
df["count"]

Column<b'count'>

Getting the list of column names from a dataframe:

In [11]:
df.columns

['DEST_COUNTRY_NAME', 'ORIGIN_COUNTRY_NAME', 'count']

### Expressions

Expression is a more general concept wich involves columns and columns transformations

Different ways of creating expressions:

In [12]:
from pyspark.sql.functions import expr
print(expr("someCol - 5"))
print(col("someCol") - 5)
print(expr("someCol") - 5)

Column<b'(someCol - 5)'>
Column<b'(someCol - 5)'>
Column<b'(someCol - 5)'>


In [13]:
## Records and rows

1- In Spark, each row in a DataFrame is a single record. 

2- Spark represents this record as an object of type Row. 

3- Spark manipulates Row objects using column expressions in order to produce usable values. 

4- Row objects internally represent arrays of bytes

Getting the first row from a dataframe

In [14]:
row = df.first()
print(row)
# Iterating through the row's valus
for field in row:
    print(f'field: {field}, type: {type(field)}')
# accessing values randomly
print('Value in the count column: ', row['count'])
print('Value in the second column: ', row[1])

Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Romania', count=15)
field: United States, type: <class 'str'>
field: Romania, type: <class 'str'>
field: 15, type: <class 'int'>
Value in the count column:  15
Value in the second column:  Romania


Creating a row:

In [15]:
from pyspark.sql import Row
my_row = Row("Hello", None, 1, False)
my_row

<Row('Hello', None, 1, False)>

In [16]:
## Transformations

Types of transformations:

 1- add rows or columns

 2- remove rows or columns

 3- transform a row into a column (or vice versa)

 4- change the order of rows based on the values in columns

In [17]:
### Creating DataFrames

The easy way:

In [20]:
df = spark.read.format("json").load("/work/data/flight-data-2015-summary.json")

The manual way:

In [21]:
from pyspark.sql import Row
from pyspark.sql.types import StructField, StructType, StringType, LongType
manual_schema = StructType([
  StructField("some", StringType(), True),
  StructField("col", StringType(), True),
  StructField("names", LongType(), False)
])
some_row = Row("Hello", None, 1)
manual_df = spark.createDataFrame([some_row], manual_schema)
manual_df.show()

+-----+----+-----+
| some| col|names|
+-----+----+-----+
|Hello|null|    1|
+-----+----+-----+



[Stage 5:>                                                          (0 + 1) / 1]                                                                                

### select and selectExpr

Equivalent to the select part of an SQL query

Selecting a single column:

In [22]:
df.select("DEST_COUNTRY_NAME").show(2)

+-----------------+
|DEST_COUNTRY_NAME|
+-----------------+
|    United States|
|    United States|
+-----------------+
only showing top 2 rows



Selecting multiple columns:

In [23]:
df.select("DEST_COUNTRY_NAME", "ORIGIN_COUNTRY_NAME").show(2)

+-----------------+-------------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|
+-----------------+-------------------+
|    United States|            Romania|
|    United States|            Croatia|
+-----------------+-------------------+
only showing top 2 rows



In [24]:
from pyspark.sql.functions import expr, col, column
df.select(
    expr("DEST_COUNTRY_NAME"),
    col("DEST_COUNTRY_NAME"),
    column("DEST_COUNTRY_NAME"))\
  .show(2)

+-----------------+-----------------+-----------------+
|DEST_COUNTRY_NAME|DEST_COUNTRY_NAME|DEST_COUNTRY_NAME|
+-----------------+-----------------+-----------------+
|    United States|    United States|    United States|
|    United States|    United States|    United States|
+-----------------+-----------------+-----------------+
only showing top 2 rows



Selecting using expressions:

In [25]:
df.select(expr("DEST_COUNTRY_NAME AS destination")).show(2)

+-------------+
|  destination|
+-------------+
|United States|
|United States|
+-------------+
only showing top 2 rows



selectExpr is a shorthand to having to specidy expr or col:

In [26]:
df.selectExpr("DEST_COUNTRY_NAME as newColumnName", "DEST_COUNTRY_NAME").show(2)

+-------------+-----------------+
|newColumnName|DEST_COUNTRY_NAME|
+-------------+-----------------+
|United States|    United States|
|United States|    United States|
+-------------+-----------------+
only showing top 2 rows



In [27]:
df.selectExpr(
  "*", # all original columns
  "(DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME) as withinCountry" # adding a new boolean column
).show(2)

+-----------------+-------------------+-----+-------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|withinCountry|
+-----------------+-------------------+-----+-------------+
|    United States|            Romania|   15|        false|
|    United States|            Croatia|    1|        false|
+-----------------+-------------------+-----+-------------+
only showing top 2 rows



Using aggregations:

In [28]:
df.selectExpr("avg(count)", "count(distinct(DEST_COUNTRY_NAME))").show()



+-----------+---------------------------------+
| avg(count)|count(DISTINCT DEST_COUNTRY_NAME)|
+-----------+---------------------------------+
|1770.765625|                              132|
+-----------+---------------------------------+



                                                                                

In [29]:
### Literals

In [30]:
from pyspark.sql.functions import lit
# adding a column of literals
df.select(expr("*"), lit(1).alias("One")).show(2)

+-----------------+-------------------+-----+---+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|One|
+-----------------+-------------------+-----+---+
|    United States|            Romania|   15|  1|
|    United States|            Croatia|    1|  1|
+-----------------+-------------------+-----+---+
only showing top 2 rows



In [31]:
### Adding columns

In [32]:
# adding a literal column again
df.withColumn("numberOne", lit(1)).show(2)

+-----------------+-------------------+-----+---------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|numberOne|
+-----------------+-------------------+-----+---------+
|    United States|            Romania|   15|        1|
|    United States|            Croatia|    1|        1|
+-----------------+-------------------+-----+---------+
only showing top 2 rows



In [33]:
# or a more complex expression
df.withColumn("withinCountry", expr("ORIGIN_COUNTRY_NAME == DEST_COUNTRY_NAME"))\
  .show(2)

+-----------------+-------------------+-----+-------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|withinCountry|
+-----------------+-------------------+-----+-------------+
|    United States|            Romania|   15|        false|
|    United States|            Croatia|    1|        false|
+-----------------+-------------------+-----+-------------+
only showing top 2 rows



### Renaming columns

In [34]:
df.withColumnRenamed("DEST_COUNTRY_NAME", "dest").columns

['dest', 'ORIGIN_COUNTRY_NAME', 'count']

In [35]:
### Column names with spaces

Let's add one column name with spaces:

In [36]:
df_with_space_col = df.withColumn(
    "This Long Column-Name", # nothing special here as the expected type of the parameter is str
    expr("ORIGIN_COUNTRY_NAME"))
df_with_space_col.show(2)

+-----------------+-------------------+-----+---------------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|This Long Column-Name|
+-----------------+-------------------+-----+---------------------+
|    United States|            Romania|   15|              Romania|
|    United States|            Croatia|    1|              Croatia|
+-----------------+-------------------+-----+---------------------+
only showing top 2 rows



Columns with spaces must be enclosed with \`\` when used in expressions:

In [37]:
df_with_space_col.selectExpr(
    "`This Long Column-Name`",
    "`This Long Column-Name` as `new col`")\
  .show(2)

+---------------------+-------+
|This Long Column-Name|new col|
+---------------------+-------+
|              Romania|Romania|
|              Croatia|Croatia|
+---------------------+-------+
only showing top 2 rows



### Removing columns

In [38]:
df.drop("ORIGIN_COUNTRY_NAME").columns

['DEST_COUNTRY_NAME', 'count']

### Changing column's type

In [39]:
df.withColumn("count2", col("count").cast("string"))

DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: bigint, count2: string]

## Filtering rows

Using filter or where

In [40]:
df.filter(col("count") < 2).show(2)
df.where("count < 2").show(2)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Croatia|    1|
|    United States|          Singapore|    1|
+-----------------+-------------------+-----+
only showing top 2 rows

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Croatia|    1|
|    United States|          Singapore|    1|
+-----------------+-------------------+-----+
only showing top 2 rows



Chainning filters:

In [41]:
df.where(col("count") < 2).where(col("ORIGIN_COUNTRY_NAME") != "Croatia")\
  .show(2)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|          Singapore|    1|
|          Moldova|      United States|    1|
+-----------------+-------------------+-----+
only showing top 2 rows



### Getting unique rows

In [42]:
df.select("ORIGIN_COUNTRY_NAME").distinct().count()

                                                                                

125

In [43]:
### Random samples

In [44]:
seed = 5
with_replacement = False
fraction = 0.3
df.sample(with_replacement, fraction, seed).count()

86

### Appending rows

In [45]:
# creating a new dataframe
from pyspark.sql import Row
schema = df.schema
rows = [
  Row("New Country", "Other Country", 5),
  Row("New Country 2", "Other Country 3", 1)
]
rdd = spark.sparkContext.parallelize(rows)
another_df = spark.createDataFrame(rows, schema)

Performing the union:

In [46]:
df.union(another_df).tail(4)

[Row(DEST_COUNTRY_NAME='Bonaire, Sint Eustatius, and Saba', ORIGIN_COUNTRY_NAME='United States', count=58),
 Row(DEST_COUNTRY_NAME='Greece', ORIGIN_COUNTRY_NAME='United States', count=30),
 Row(DEST_COUNTRY_NAME='New Country', ORIGIN_COUNTRY_NAME='Other Country', count=5),
 Row(DEST_COUNTRY_NAME='New Country 2', ORIGIN_COUNTRY_NAME='Other Country 3', count=1)]

### Sorting rows
Using sort or orderBy
Always sort in ascending order by default

In [47]:
df.orderBy("count", "DEST_COUNTRY_NAME").show(5)
df.sort(col("count"), col("DEST_COUNTRY_NAME")).show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|     Burkina Faso|      United States|    1|
|    Cote d'Ivoire|      United States|    1|
|           Cyprus|      United States|    1|
|         Djibouti|      United States|    1|
|        Indonesia|      United States|    1|
+-----------------+-------------------+-----+
only showing top 5 rows

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|     Burkina Faso|      United States|    1|
|    Cote d'Ivoire|      United States|    1|
|           Cyprus|      United States|    1|
|         Djibouti|      United States|    1|
|        Indonesia|      United States|    1|
+-----------------+-------------------+-----+
only showing top 5 rows



With count in descending order:

In [48]:
df.orderBy(col("count").desc(), "DEST_COUNTRY_NAME").show(5)

+-----------------+-------------------+------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| count|
+-----------------+-------------------+------+
|    United States|      United States|370002|
|    United States|             Canada|  8483|
|           Canada|      United States|  8399|
|    United States|             Mexico|  7187|
|           Mexico|      United States|  7140|
+-----------------+-------------------+------+
only showing top 5 rows



## Repartition and Coalesce

Work exactly the same as with RDDs

In [49]:
df.rdd.getNumPartitions()

1

In [50]:
df.repartition(5).rdd.getNumPartitions()

5

### Repartition by column

In [51]:
repart_rdd = df.repartition(5, col("DEST_COUNTRY_NAME")).rdd
print('Number of partitions after repartition:', repart_rdd)

Number of partitions after repartition: MapPartitionsRDD[149] at javaToPython at NativeMethodAccessorImpl.java:0


### Coalesce

In [52]:
df.repartition(5, col("DEST_COUNTRY_NAME")).coalesce(2).rdd.getNumPartitions()

2

### Replacing values

In [53]:
# loading a different data set
retail = spark.read.format("csv")\
  .option("header", "true")\
  .option("inferSchema", "true")\
  .load("/work/data/online-retail-dataset.csv")
retail.printSchema()

[Stage 35:>                                                         (0 + 2) / 2]

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: string (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: integer (nullable = true)
 |-- Country: string (nullable = true)



                                                                                

### Removing null values

Remove all rows with null values

In [55]:
print('Before:', retail.count())
print('After: ', retail.na.drop().count())

Before: 541909


[Stage 38:>                                                         (0 + 2) / 2]

After:  406829


                                                                                

Or applying na.drop only to a subset of columns:

In [56]:
retail.na.drop(subset=["CustomerID"]).count()

                                                                                

406829

### Filling null values

In [58]:
retail.na.fill(0).drop().count()

541909