# Introductory notebook for Pandas on Spark
Pandas on Spark bridges the gap between pandas' ease of use and Spark's scalability.  
It is particularly useful for those who are already proficient with Pandas and want to use Spark to scale out.   
Run this notebook in local mode (not attached to any cluster).

See also [Documentation on Pandas on Spark](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html)

Contact: Luca.Canali@cern.ch

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PandasOnSpark").getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/12/21 11:39:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Creating DataFrames

In [3]:
# Create a DataFrame from existing pandas DataFrame:
    
import pandas as pd
import pyspark.pandas as ps

# Creates a Pandas DataFrame
pandas_df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Converts the Pandas DataFrame into a Pandas-on-Spark DataFrame
spark_ps = ps.DataFrame(pandas_df)

In [4]:
# Create Pandas-on-Spark DataFrames directly

spark_ps1 = ps.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

spark_ps2 = ps.DataFrame(range(10))

## Viewing Data

In [5]:
# Show first n rows: 

spark_ps.head(3)


                                                                                

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


## Convert

In [6]:
# Convert a Pandas-on_Spark DataFrame to a Spark DataFrame
pandas_df = spark_ps.to_pandas()

pandas_df




Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


## Data subsetting

In [7]:
# Selecting values from a named column

spark_ps['A']

0    1
1    2
2    3
Name: A, dtype: int64

In [8]:
# Slicing

spark_ps[1:3]

Unnamed: 0,A,B
1,2,5
2,3,6


In [10]:
# Filtering 

spark_ps[spark_ps['A'] < 2]

Unnamed: 0,A,B
0,1,4


## Applying functions

In [11]:
# Apply a function to the data

spark_ps['A'].apply(lambda x: x * 2)


                                                                                

0    2
1    4
2    6
Name: A, dtype: int64

## Grouping Data

In [12]:
# Group by column 'A'

spark_ps.groupby('A').sum()

Unnamed: 0_level_0,B
A,Unnamed: 1_level_1
1,4
3,6
2,5


## Handling Missing Data

In [13]:
# Drop NA: 

spark_ps.dropna()

# Fill NA: 

spark_ps.fillna('null')
        

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


## Joining and Merging

In [15]:
# Merge / Join two Spark-on-Pandas DataFrames

spark_ps1 = ps.DataFrame({'A1': [1, 2, 3], 'B': [0, 0, 1]})
spark_ps2= ps.DataFrame({'A2': [4, 5, 6], 'B': [0, 1, 1]})


ps.merge(spark_ps1, spark_ps2, on='B', how='inner')


Unnamed: 0,A1,B,A2
0,1,0,4
1,2,0,4
2,3,1,5
3,3,1,6


In [16]:
spark.stop()