# Introductory notebook for Pandas on Spark
Pandas on Spark bridges the gap between pandas' ease of use and Spark's scalability.  
It is particularly useful for those who are already proficient with Pandas and want to use Spark to scale out. 

See also [Documentation on Pandas on Spark](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html)

Contact: Luca.Canali@cern.ch

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PandasOnSpark").getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/12/01 16:05:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Creating DataFrames

In [2]:
# Create a DataFrame from existing pandas DataFrame:
    
import pandas as pd
import pyspark.pandas as ps

pandas_df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
spark_df = ps.DataFrame(pandas_df)



In [3]:
# Or create it from scratch

spark_df = ps.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})


## Viewing Data

In [4]:
# Show first n rows: 

spark_df.head(3)


                                                                                

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


## Data subsetting

In [5]:
# Selecting values from a named column

spark_df['A']

0    1
1    2
2    3
Name: A, dtype: int64

In [6]:
# Slicing

spark_df[1:3]

Unnamed: 0,A,B
1,2,5
2,3,6


In [7]:
# Filtering 

spark_df[spark_df['A'] < 2]

Unnamed: 0,A,B
0,1,4


## Applying functions

In [8]:
# Apply a function to the data

spark_df['A'].apply(lambda x: x * 2)




0    2
1    4
2    6
Name: A, dtype: int64

## Grouping Data

In [9]:
spark_df.groupby('A').sum()



Unnamed: 0_level_0,B
A,Unnamed: 1_level_1
1,4
3,6
2,5


## Handling Missing Data

In [10]:
# Drop NA: 

spark_df.dropna()

# Fill NA: 
spark_df.fillna('null')
        

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


## Joining and Merging

In [11]:
spark_df1 = ps.DataFrame({'A1': [1, 2, 3], 'B': [0, 0, 1]})
spark_df2= ps.DataFrame({'A2': [4, 5, 6], 'B': [0, 1, 1]})


ps.merge(spark_df1, spark_df2, on='B', how='inner')


Unnamed: 0,A1,B,A2
0,1,0,4
1,2,0,4
2,3,1,5
3,3,1,6


## Converting to pandas DataFrame

In [12]:
pandas_df = spark_df.to_pandas()

pandas_df



Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


In [13]:
spark.stop()