## PySpark, pandas and Koalas comparison

This document compares how to manipulate a DataFrame with PySpark, pandas and Koalas.









Koalas is a project by DataBricks, initially released in 2019, which aims to bridge the gap for those who know pandas and want to leverage the power of distributed computing, but who do not know PySpark. The syntax for most operations is the same as for pandas, so there is less need to learn PySpark from scratch.

The Koalas project is still in development and not all pandas or PySpark functionality is available yet. As of June 2021, new versions of the package are being released on a regular basis.



Koalas isn't installed on CDSW by default so will require installation from [Artifactory](http://np2rvlapxx507/DAP_CATS/guidance/-/blob/master/Artifactory.md); please see the [Koalas setup instructions](http://np2rvlapxx507/DAP_CATS/guidance/-/blob/master/koalas_setup.md) for more details.

This document is not intended as a recommendation to learn one implementation over another.

Throughout this document, to clearly distinguish between different types of DataFrames (DFs), pandas DFs are prefixed with `pdf`, Koalas DFs with `kdf`, and PySpark DFs with `sdf`. Derived scalar values are suffixed `_pandas`. `_koalas` and `_spark`. Prefixing or suffixing variable names with their data type in this way is generally not necessary when writing your own code; instead, choose short sensible names for your DFs. For more information on variable naming, please consult [Core Programming: Naming Variables](https://best-practice-and-impact.github.io/qa-of-code-guidance/core_programming.html#naming-variables) from the [QA of Code for Analysis and Research](https://best-practice-and-impact.github.io/qa-of-code-guidance/intro.html).

## Reading, creating and converting data

### Import relevant packages and setup

Load packages into CDSW Python session.

````{tabs}
```{code-tab} python pandas
import pandas as pd
```
```{code-tab} python PySpark
from pyspark.sql import SparkSession, functions as F
```
```{code-tab} python Koalas
import databricks.koalas as ks # Note the databricks prefix here for Koalas
```
````

If you haven't set the environment variables `ARROW_PRE_0_15_IPC_FORMAT` and `PYARROW_IGNORE_TIMEZONE` to `1` then you may get a warning when importing Koalas.

Note that Koalas will use an existing Spark session if one exists, otherwise it will create one with the session name `Koalas` the first time a Koalas DataFrame is loaded.

````{tabs}
```{code-tab} python Koalas
# Start a Spark session
spark = (SparkSession.builder.master("local[2]")
         .appName("comparison")
         .getOrCreate())
```         
```{code-tab} python PySpark
spark = (SparkSession.builder.master("local[2]")
         .appName("comparison")
         .getOrCreate())
```
````

### Create DataFrame

Create pandas, Koalas or PySpark DataFrame based on rows and columns
````{tabs}
```{code-tab} python pandas
rows = [["SW1V", 23153], ["PO15", 22526], ["NP10", 23424]]
columns = ["postcode_district", "population"]

pdf = pd.DataFrame(rows, columns=columns)
```
```{code-tab} python PySpark
rows = [["SW1V", 23153], ["PO15", 22526], ["NP10", 23424]]
columns = ["postcode_district", "population"]

sdf = spark.createDataFrame(rows, columns)
```
```{code-tab} python Koalas
rows = [["SW1V", 23153], ["PO15", 22526], ["NP10", 23424]]
columns = ["postcode_district", "population"]

kdf = ks.DataFrame(rows, columns=columns)
```
````

### Convert DataFrame to pandas

Convert from specified DataFrame type to pandas
````{tabs}
```{code-tab} python pandas
pdf = kdf.to_pandas()

```
```{code-tab} python PySpark
pdf = sdf.toPandas()
```
````

### Convert DataFrame to Koalas

Convert from specified DataFrame type to Koalas
````{tabs}
```{code-tab} python pandas
kdf = ks.from_pandas(pdf)

```
```{code-tab} python PySpark
kdf = ks.DataFrame(sdf)
```
````

### Convert DataFrame to PySpark

Convert from specified DataFrame type to PySpark
````{tabs}
```{code-tab} python pandas
sdf = spark.createDataFrame(pdf)
```
```{code-tab} python pandas
sdf = kdf.to_spark()
```
````

### Read CSV from HDFS

Read in a CSV from HDFS.

For pandas you can read and write directly to HDFS with [Pydoop](http://np2rvlapxx507/DAP_CATS/troubleshooting/tip-of-the-week/-/blob/master/tip_12_pydoop.ipynb) or go via PySpark.

file_path = "/training/animal_rescue.csv"
````{tabs}
```{code-tab} python pandas
# Easiest to read in with Koalas/Pyspark then convert
# Be careful of file size since all the data needs to be able to fit on the driver (CDSW or notebook session)
sdf = spark.read.csv(file_path, header=True, inferSchema=True)
pdf = sdf.toPandas()

```
```{code-tab} python pandas
kdf = ks.read_csv(file_path, header=0)

```
```{code-tab} python PySpark
sdf = spark.read.csv(file_path, header=True, inferSchema=True)
```
````

## Selecting and renaming columns

### Rename one column

Rename just one column

old_col = "IncidentNumber"
new_col = "incident_number"
````{tabs}
```{code-tab} python pandas
pdf = pdf.rename({old_col: new_col}, axis=1)

```
```{code-tab} python pandas
kdf = kdf.rename({old_col: new_col}, axis=1)

```
```{code-tab} python PySpark
sdf = sdf.withColumnRenamed(old_col, new_col)
```
````

### Rename multiple columns

Use a dictionary to rename multiple columns

columns_dict = {"HourlyNotionalCost(£)": "total_cost",
                "AnimalGroupParent": "animal_group",
                "CalYear": "cal_year",
                "PumpHoursTotal": "job_hours",
                "PumpCount": "engine_count",
                "OriginofCall": "origin_of_call",
                "PropertyType": "property_type",
                "PropertyCategory": "property_category"
}
````{tabs}
```{code-tab} python pandas
pdf = pdf.rename(columns_dict, axis=1)

```
```{code-tab} python pandas
kdf = kdf.rename(columns_dict, axis=1)

```
```{code-tab} python PySpark
for old_col, new_col in columns_dict.items():
    sdf = sdf.withColumnRenamed(old_col, new_col)
    ```
````

### Select columns

Select a subset of columns
````{tabs}
```{code-tab} python pandas
columns = ["incident_number"] + [new_col[1] for new_col in columns_dict.items()]

pdf = pdf[columns]

```
```{code-tab} python pandas
columns = ["incident_number"] + [new_col[1] for new_col in columns_dict.items()]

kdf = kdf[columns]

```
```{code-tab} python PySpark
columns = ["incident_number"] + [new_col[1] for new_col in columns_dict.items()]

sdf = sdf.select(columns)
```
````

### Drop columns

Remove columns

columns = ["property_type", "property_category"]
````{tabs}
```{code-tab} python pandas
pdf.drop(columns, axis=1, inplace=True)

```
```{code-tab} python pandas
# axis=1 is the default value
kdf = kdf.drop(columns)

```
```{code-tab} python PySpark
# If using a list, need to unpack the list with *
sdf = sdf.drop(*columns)
# This is not needed if passing the column names directly
sdf = sdf.drop(columns[0], columns[1])
```
````

### Derive new column

Calculate a new column based on other columns. This will overwrite an existing column if one already exists with the same name.

new_col = "incident_duration"
numerator_col = "job_hours"
denominator_col = "engine_count"
````{tabs}
```{code-tab} python pandas
pdf[new_col] = pdf[numerator_col] / pdf[denominator_col]

```
```{code-tab} python pandas
kdf[new_col] = kdf[numerator_col] / kdf[denominator_col]

```
```{code-tab} python PySpark
sdf = sdf.withColumn(new_col, F.col(numerator_col) / F.col(denominator_col))
```
````

## Preview data and structure

### Preview data

Display first n rows. PySpark and Koalas DataFrames are not ordered in the same way as pandas DFs, and so this may return different results unless explicitly ordered with `sdf.sort()`/`sdf.orderBy()` or `kdf.sort_values()`.

`tail()` does not exist in Spark 2.4.0, even on an ordered DF, and so cannot be used on PySpark or Koalas DFs.


````{tabs}
```{code-tab} python pandas - default 5
n = 5
pdf.head(n)
```

```{code-tab} python pandas - default 5
n = 5
kdf.head(n)
```

```{code-tab} python PySpark - default 20
n = 5
sdf.show(n)
```
````

### Data types

Get data types

````{tabs}
```{code-tab} python pandas
types_pandas = pdf.dtypes

```
```{code-tab} python pandas
types_koalas = kdf.dtypes

```
```{code-tab} python PySpark
types_spark = sdf.dtypes
```
````

### Row count

Get number of rows

````{tabs}
```{code-tab} python pandas
row_ct_pandas = pdf.shape[0]
```
```{code-tab} python pandas
row_ct_koalas = kdf.shape[0]
```
```{code-tab} python PySpark
row_ct_spark = sdf.count()
```
````

### Column count

Get number of columns

````{tabs}
```{code-tab} python pandas
col_ct_pandas = pdf.shape[1]

```
```{code-tab} python pandas
col_ct_koalas = pdf.shape[1]

```
```{code-tab} python PySpark
# This is not a function, so no () after columns
col_ct_spark = len(sdf.columns)
```
````

### Count distinct

Count distinct number of entries

````{tabs}
```{code-tab} python pandas: two possible methods
distinct_ct_pandas = len(pdf["animal_group"].unique())
distinct_ct_pandas = len(set(pdf["animal_group"]))

```
```{code-tab} python pandas
distinct_ct_koalas = len(kdf["animal_group"].unique())

```
```{code-tab} python PySpark
distinct_ct_spark = sdf.select("animal_group").distinct().count()
```
````

## Filter rows

Be careful not to confuse the `.filter()` operation; in pandas it will filter on a row index (which doesn't exist in Spark), whereas in PySpark it will filter on values, the same as `.loc` in pandas.

### Filter rows on values from one column

Filter rows on values from one column

````{tabs}
```{code-tab} python pandas
pdf_dogs = pdf.loc[pdf["animal_group"] == "Dog"]

```
```{code-tab} python pandas
kdf_dogs = kdf.loc[kdf["animal_group"] == "Dog"]

```
```{code-tab} python PySpark
sdf_dogs = sdf.filter(F.col("animal_group") == "Dog")
```
````

### Filter rows on values from multiple columns

Filter rows on values from multiple columns; ensure you have each condition wrapped in brackets

````{tabs}
```{code-tab} python pandas
pdf_dogs_recent = pdf.loc[(pdf["animal_group"] == "Dog") & (pdf["cal_year"] >= 2017)]

```
```{code-tab} python pandas
kdf_dogs_recent = kdf.loc[(kdf["animal_group"] == "Dog") & (kdf["cal_year"] >= 2017)]

```
```{code-tab} python PySpark
sdf_dogs_recent = sdf.filter((F.col("animal_group") == "Dog") & ((F.col("cal_year") >= 2017)))
```
````

## Handling missing values

### Filter nulls

Only return rows where specified column is null

col_name = "job_hours"
````{tabs}
```{code-tab} python pandas
pdf_null_rows = pdf[pdf[col_name].isnull()]

```
```{code-tab} python pandas
kdf_null_rows = kdf[kdf[col_name].isnull()]

```
```{code-tab} python PySpark
sdf_null_rows = sdf.filter(F.col(col_name).isNull())
```
````

### Filter non-nulls

Only return rows where specified column is not null

col_name = "job_hours"
````{tabs}
```{code-tab} python pandas
pdf = pdf[~pdf[col_name].isnull()]

```
```{code-tab} python pandas
kdf = kdf[~kdf[col_name].isnull()]

```
```{code-tab} python PySpark
sdf = sdf.filter(F.col(col_name).isNotNull())
```
````

### Fill nulls

Fill nulls with 0 (or other value)

fill_value = 0
````{tabs}
```{code-tab} python pandas
pdf_nato0 = pdf.fillna(fill_value)

```
```{code-tab} python pandas
kdf_nato0 = kdf.fillna(fill_value)

```
```{code-tab} python PySpark
sdf_nato0 = sdf.fillna(fill_value)
```
````

## Grouping and aggregating

### Sum a column

Returns sum of one column as a scalar value

col_name = "total_cost"
````{tabs}
```{code-tab} python pandas
col_sum_pandas = pdf[col_name].sum()

```
```{code-tab} python pandas
col_sum_koalas = kdf[col_name].sum()

```
```{code-tab} python PySpark: sdf.agg() with one column will return a one row DF, use .collect()[0][0] to get a scalar
col_sum_spark = sdf.agg(F.sum(col_name)).collect()[0][0]
```
````

### Get maximum of a column

Returns maximum of one column as a scalar value

col_name = "total_cost"
````{tabs}
```{code-tab} python pandas
col_max_pandas = pdf[col_name].max()

```
```{code-tab} python pandas
col_max_koalas = kdf[col_name].max()

```
```{code-tab} python PySpark: sdf.agg() with one column will return a one row DF, use .collect()[0][0] to get a scalar
col_max_spark = sdf.agg(F.max(col_name)).collect()[0][0]
```
````

### Get minimum of a column

Returns minimum of one column as a scalar value

col_name = "total_cost"
````{tabs}
```{code-tab} python pandas
col_min_pandas = pdf[col_name].min()

```
```{code-tab} python pandas
col_min_koalas = kdf[col_name].min()

```
```{code-tab} python PySpark: sdf.agg() with one column will return a one row DF, use .collect()[0][0] to get a scalar
col_min_spark = sdf.agg(F.min(col_name)).collect()[0][0]
```
````

### Multiple aggregations to every column

Multiple aggregations, applied to every non-grouped column

group_col = "animal_group"
````{tabs}
```{code-tab} python pandas
pdf_agg = pdf.groupby([group_col]).agg(["min", "max"])

```
```{code-tab} python pandas
kdf_agg = kdf.groupby([group_col]).agg(["min", "max"])

```
```{code-tab} python PySpark
cols = sdf.columns
cols.remove(group_col)
sdf_agg = sdf.groupBy(group_col).agg(*[F.min(x) for x in cols], *[F.max(x) for x in cols])
```
````

### Multiple aggregations to specific columns

Multiple aggregations, applied to specific columns

group_col = "animal_group"
num_col_1 = "total_cost"
num_col_2 = "job_hours"
````{tabs}
```{code-tab} python pandas
pdf_agg = pdf.groupby([group_col]).agg({num_col_1: ["sum"], num_col_2: ["max"]})

```
```{code-tab} python pandas
kdf_agg = kdf.groupby([group_col]).agg({num_col_1: ["sum"], num_col_2: ["max"]})

```
```{code-tab} python PySpark
sdf_agg = sdf.groupBy(group_col).agg(F.sum(num_col_1), F.max(num_col_2))
```
````

### Multiple renamed aggregations

Multiple aggregations, specify aggregate column names

````{tabs}
```{code-tab} python pandas
group_col = "animal_group"
num_col_1 = "total_cost"
num_col_2 = "job_hours"
num_col_1_alias = "total_cost_sum"
num_col_2_alias = "job_hours_max"
pdf_agg = pdf.groupby([group_col]).agg(num_col_1_alias=(num_col_1, "sum"), num_col_2_alias=(num_col_2, "max"))

```
```{code-tab} python pandas
group_col = "animal_group"
num_col_1 = "total_cost"
num_col_2 = "job_hours"
num_col_1_alias = "total_cost_sum"
num_col_2_alias = "job_hours_max"
kdf_agg = kdf.groupby([group_col]).agg(num_col_1_alias=(num_col_1, "sum"), num_col_2_alias=(num_col_2, "max"))

```
```{code-tab} python PySpark
group_col = "animal_group"
num_col_1 = "total_cost"
num_col_2 = "job_hours"
num_col_1_alias = "total_cost_sum"
num_col_2_alias = "job_hours_max"
sdf_agg = sdf.groupBy(group_col).agg(F.sum(num_col_1).alias(num_col_1_alias), F.max(num_col_2).alias(num_col_2_alias))
```
````

### Pivot table

Create an Excel style pivot table

index_col = ["animal_group", "origin_of_call"]
pivot_col = "cal_year"
value_col = ["total_cost"]
````{tabs}
```{code-tab} python pandas
# Two methods: on the DataFrame or using pd.pivot_table
pdf_pivot = pdf.pivot_table(index=index_col, columns=pivot_col, values=value_col, aggfunc=sum)
pdf_pivot = pd.pivot_table(pdf, index=index_col, columns=pivot_col, values=value_col, aggfunc=sum)

`````{code-tab} python pandas
# Note that ks.pivot_table doesn’t exist
# aggfunc must be a string
kdf_pivot = kdf.pivot_table(index=index_col, columns=pivot_col, values=value_col, aggfunc="sum")

``````{code-tab} python PySpark
# The list inputs have been unpacked with *
sdf_pivot = sdf.groupBy(*index_col).pivot(pivot_col).sum(*value_col)
# Unpacking not needed if passing the column names directly
sdf_pivot = sdf.groupBy(index_col[0], index_col[1]).pivot(pivot_col).sum(value_col[0])
```
````

## Sorting

### Sorting by columns

Sort by specified columns. Note that in pandas and Koalas you can use `inplace=True` rather than having to reassign to the variable name.

sort_col_1 = "total_cost"
sort_col_2 = "incident_number"
````{tabs}
```{code-tab} python pandas
pdf.sort_values([sort_col_1, sort_col_2], ascending=[False, True], inplace=True)

`````{code-tab} python pandas
kdf.sort_values([sort_col_1, sort_col_2], ascending=[False, True], inplace=True)

``````{code-tab} python PySpark
sdf = sdf.sort([sort_col_1, sort_col_2], ascending=[False, True])
sdf = sdf.orderBy([sort_col_1, sort_col_2], ascending=[False, True])
```
````

## Combining two DataFrames

### Join

Join two DataFrames; must be same DataFrame type, e.g. you can't join a PySpark DF and pandas DF without converting one of them first.

rows = [["Cat", "Meow"],
        ["Dog", "Woof"],
        ["Cow", "Moo"]]

columns = ["animal_group", "animal_noise"]

join_col = "animal_group"
````{tabs}
```{code-tab} python pandas
pdf_desc = pd.DataFrame(rows, columns=columns)
pdf_joined = pdf.merge(pdf_desc, on=[join_col], how="left")

`````{code-tab} python pandas
kdf_desc = ks.DataFrame(rows, columns=columns)
kdf_joined = kdf.merge(kdf_desc, on=[join_col], how="left")

``````{code-tab} python PySpark
sdf_desc = spark.createDataFrame(rows, columns)
sdf_joined = sdf.join(sdf_desc, on=[join_col], how="left")
```
````

### Append/Union All

Append a DataFrame to another. Ensure the schema (column names and types) is identical and in the same order in both DataFrames first.

# Create another DF for pandas, Koalas and PySpark; note that the _dogs DFs were already created earlier
pdf_cats = pdf.loc[pdf["animal_group"] == "Cat"]
kdf_cats = kdf.loc[kdf["animal_group"] == "Cat"]
sdf_cats = sdf.filter(F.col("animal_group") == "Cat")
````{tabs}
```{code-tab} python pandas: two possible methods
pdf_dogs_and_cats = pdf_dogs.append(pdf_cats)
pdf_dogs_and_cats = pd.concat([pdf_dogs, pdf_cats])

`````{code-tab} python pandas: two possible methods
kdf_dogs_and_cats = kdf_dogs.append(kdf_cats)
kdf_dogs_and_cats = ks.concat([kdf_dogs, kdf_cats])

``````{code-tab} python PySpark
# union will keep duplicates, equivalent to UNION ALL in SQL
sdf_dogs_and_cats = sdf_dogs.union(sdf_cats)
```
````

## Comparing DataFrames

### DataFrame element-wise equality

Compare elements between two DataFrames, return True if identical, False otherwise.

This cannot be done in PySpark without a custom function. For whole rows consider `sdf_v2.exceptAll(sdf)`.

````{tabs}
```{code-tab} python pandas
pdf_v2 = pdf.copy()
pdf_equal = pdf_v2.eq(pdf)

```
```{code-tab} python Koalas
kdf_v2 = kdf
kdf_equal = kdf_v2.eq(kdf)
```
````

### Whole DataFrame equality

Compare equality of two DataFrames.

This cannot be done in PySpark; see the [Pytest for PySpark repository](http://np2rvlapxx507/DAP_CATS/Training/pytest-for-pyspark) for unit testing examples.

Note that the Koalas behaviour for `kdf_equal = kdf_v2.equals(kdf)` gives the same result as `koalas.eq`. Consider converting to PySpark and using the unit testing examples.

pdf_v2 = pdf.copy()
````{tabs}
```{code-tab} python pandas
equality_pandas = pdf_v2.equals(pdf)
```
````

## Custom functions

### Lambda functions/UDF

Apply custom functions.

In general try and avoid these and use the in-built functionality wherever possible as it will be far more efficient.

Lambda functions should work in Koalas but currently do not at the time of writing.

# Define function
def job_minutes(job_hours):
    return job_hours * 60
````{tabs}
```{code-tab} python pandas
pdf["job_minutes"] = pdf.apply(lambda x: job_minutes(x["job_hours"]), axis=1)

``````{code-tab} python PySpark
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType

# Register UDF
spark.udf.register("job_minutes_udf_reg", job_minutes)
job_minutes_udf = udf(job_minutes, DoubleType())
                       
sdf = sdf.withColumn("job_minutes", job_minutes_udf(F.col("job_hours")))
```
````

## Other useful operations

### Rounding

Note that PySpark and pandas round numbers ending in `.5` differently; PySpark and Koalas will round away from zero whereas pandas rounds to the nearest even integer. This is explained in more detail in the [Tip: Rounding differences in Python, R and Spark](http://np2rvlapxx507/DAP_CATS/troubleshooting/tip-of-the-week/-/blob/master/tip_32_rounding.ipynb).

````{tabs}
```{code-tab} python pandas
# Rounds .5 to nearest even integer
pdf["incident_duration_round"] = pdf["incident_duration"].round()

`````{code-tab} python pandas
# Uses the Spark method of rounding (.5 is away from zero)
kdf["incident_duration_round"] = kdf["incident_duration"].round()

``````{code-tab} python PySpark
# Default round for .5 is away from zero, F.bround is idential to pandas
sdf = sdf.withColumn("incident_duration_round", F.round("incident_duration"))
sdf = sdf.withColumn("incident_duration_round", F.bround("incident_duration"))
```
````

### Close Spark session

Disconnect from the Spark session. This is good practice to free up resources on the Spark cluster for other users.

````{tabs}
```{code-tab} python PySpark and Koalas
spark.stop()
```
````

### Further Resources

- [10 Minutes to Koalas](https://docs.databricks.com/languages/koalas.html): Introduction to Koalas from Databricks
- [Artifactory setup](http://np2rvlapxx507/DAP_CATS/guidance/-/blob/master/Artifactory.md)
- [Koalas setup instructions](http://np2rvlapxx507/DAP_CATS/guidance/-/blob/master/koalas_setup.md)
- [Pytest for PySpark repository](http://np2rvlapxx507/DAP_CATS/Training/pytest-for-pyspark)
- [QA of Code for Analysis and Research](https://best-practice-and-impact.github.io/qa-of-code-guidance/intro.html):
    - [Core Programming: Naming Variables](https://best-practice-and-impact.github.io/qa-of-code-guidance/core_programming.html#naming-variables)
- [Tip of the Week](http://np2rvlapxx507/DAP_CATS/troubleshooting/tip-of-the-week):
    - [Rounding differences in Python, R and Spark](http://np2rvlapxx507/DAP_CATS/troubleshooting/tip-of-the-week/-/blob/master/tip_32_rounding.ipynb)
    - [Pydoop](http://np2rvlapxx507/DAP_CATS/troubleshooting/tip-of-the-week/-/blob/master/tip_12_pydoop.ipynb)