# 4_2_Example_Data_Bank — **pandas API on Spark** version

This notebook demonstrates how to run your pandas-style workflow on top of Spark using the **pandas API on Spark** (`pyspark.pandas`), taking into account the Quickstart you referenced.

It includes:
- A quick primer (object creation, conversions between pandas ↔ pandas-on-Spark ↔ Spark DataFrame).
- Ready-to-use I/O patterns (`ps.read_csv`, `psdf.to_parquet`, Spark I/O).
- Practical tips to **port your existing pandas notebook** with minimal code changes (mostly `import pandas as pd` → `import pyspark.pandas as ps`).

> If you already have a Spark cluster/session (Databricks, local Spark, EMR, etc.), run the cells below as-is.


In [None]:
import pandas as pd
import numpy as np
import pyspark.pandas as ps
from pyspark.sql import SparkSession

# Start or get Spark
spark = SparkSession.builder.appName("BankExample-pandas-on-Spark").getOrCreate()

# Optional: tune options (Arrow + distributed index)
prev_arrow = spark.conf.get("spark.sql.execution.arrow.pyspark.enabled", "false")
ps.set_option("compute.default_index_type", "distributed")  # lighter index by default
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", True)

print("Spark version:", spark.version)
print("Arrow enabled:", spark.conf.get("spark.sql.execution.arrow.pyspark.enabled"))


In [None]:
# --- Object Creation (from your Quickstart) ---

# pandas-on-Spark Series
s = ps.Series([1, 3, 5, np.nan, 6, 8])
display(s)

# pandas-on-Spark DataFrame
psdf = ps.DataFrame(
    {'a': [1, 2, 3, 4, 5, 6],
     'b': [100, 200, 300, 400, 500, 600],
     'c': ["one", "two", "three", "four", "five", "six"]},
    index=[10, 20, 30, 40, 50, 60]
)
display(psdf)

# pandas DataFrame → pandas-on-Spark
dates = pd.date_range('20130101', periods=6)
pdf = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
psdf2 = ps.from_pandas(pdf)
display(psdf2.head())

# pandas DataFrame → Spark DataFrame
sdf = spark.createDataFrame(pdf)
sdf.show()

# Spark DataFrame → pandas-on-Spark DataFrame
psdf3 = sdf.pandas_api()
display(psdf3.head())

# dtypes, index, columns, numpy view
display(psdf3.dtypes)
display(psdf3.index)
display(psdf3.columns)
display(psdf3.to_numpy())

# describe, transpose, sort
display(psdf3.describe())
display(psdf3.T)
display(psdf3.sort_index(ascending=False).head())
display(psdf3.sort_values(by='B').head())


In [None]:
# --- Missing Data ---
pdf1 = pdf.reindex(index=dates[0:4], columns=list(pdf.columns) + ['E'])
pdf1.loc[dates[0]:dates[1], 'E'] = 1
psdf1 = ps.from_pandas(pdf1)
display(psdf1)

display(psdf1.dropna(how='any'))
display(psdf1.fillna(value=5))


In [None]:
# --- Grouping ---
psdf_g = ps.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
                       'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                       'C': np.random.randn(8),
                       'D': np.random.randn(8)})

display(psdf_g)
display(psdf_g.groupby('A').sum())
display(psdf_g.groupby(['A', 'B']).sum())


In [None]:
# --- Plotting ---
# For large data, consider sampling before plotting.
pser = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
psser = ps.Series(pser).cummax()
ax = psser.plot()
ax.set_title("Series cummax()")

pdf_plot = pd.DataFrame(np.random.randn(1000, 4), index=pser.index, columns=['A', 'B', 'C', 'D'])
psdf_plot = ps.from_pandas(pdf_plot).cummax()
ax2 = psdf_plot.plot()
ax2.set_title("DataFrame cummax()")


In [None]:
# --- Getting data in/out ---

# CSV (read/write)
# Replace 'path/to/file.csv' with your file path (local, DBFS, s3://, abfss://, etc.)
# Read
# ps.read_csv supports common pandas-like options.
# bank_psdf = ps.read_csv('path/to/file.csv')
# Write
# bank_psdf.to_csv('foo.csv')

# Parquet (recommended for speed + schema)
# bank_psdf.to_parquet('bar.parquet')
# ps.read_parquet('bar.parquet')

# Spark I/O interop (ORC, JDBC, etc.)
# bank_psdf.spark.to_spark_io('zoo.orc', format="orc")
# ps.read_spark_io('zoo.orc', format="orc")

# Spark DataFrame interop at any time:
# sdf_bank = bank_psdf.to_spark()          # pandas-on-Spark -> Spark DF
# bank_psdf2 = sdf_bank.pandas_api()       # Spark DF -> pandas-on-Spark


In [None]:
# --- Porting your existing pandas notebook ---
# 1) Replace imports (where possible):
#    import pandas as pd    -> keep if you still use pandas locally
#    import pyspark.pandas as ps    # new: use ps everywhere you used pandas ops on big data
#
# 2) Use ps.read_csv / ps.read_parquet instead of pd.read_csv/pd.read_parquet for big data.
#
# 3) Many pandas ops work the same: .head(), .describe(), .groupby(), .fillna(), .dropna(), .merge(), .sort_values(), etc.
#
# 4) For plotting or ops requiring in-memory arrays, use small samples, or convert via .to_pandas() on small/filtered data:
#    psdf.sample(frac=0.05, random_state=42).to_pandas().plot(...)
#
# 5) If you already have a Spark DataFrame 'df' from previous steps, instantly bridge to pandas-on-Spark:
#    psdf = df.pandas_api()
#
# 6) Ordered head (preserve natural order) adds sorting overhead. Enable only if needed:
#    ps.set_option('compute.ordered_head', True)   # default False
#
# 7) Arrow acceleration is already enabled above. You can toggle if needed:
#    spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True)
#
# 8) When something isn't supported in pandas-on-Spark, drop down to Spark DataFrame APIs:
#    sdf = psdf.to_spark()
#    # do Spark transformations...
#    psdf = sdf.pandas_api()


In [None]:
# --- (Optional) Reset options at the end ---
# ps.reset_option("compute.default_index_type")
# spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", prev_arrow)
