<a href="https://colab.research.google.com/github/orcascope/spark_play/blob/main/setup_spark_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd /content/drive/MyDrive/Github/

/content/drive/MyDrive/Github


In [14]:
!rm -rf .git
!git clone https://github.com/orcascope/spark_play

Cloning into 'spark_play'...
remote: Enumerating objects: 15, done.[K
remote: Counting objects: 100% (15/15), done.[K
remote: Compressing objects: 100% (13/13), done.[K
remote: Total 15 (delta 2), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (15/15), 1.17 MiB | 4.35 MiB/s, done.
Resolving deltas: 100% (2/2), done.


In [15]:
%cd spark_play

/content/drive/MyDrive/Github/spark_play


In [16]:
!apt-get -qq update > /tmp/apt.out
!apt-get install -y -qq openjdk-11-jdk-headless

In [17]:
!(wget -q --show-progress -nc https://archive.apache.org/dist/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz)
!tar xf spark-3.2.1-bin-hadoop3.2.tgz



In [18]:
try:
  import pyspark, findspark, delta, pyngrok
except:
  %pip install -q --upgrade pyspark==3.2.1
  %pip install -q findspark
  %pip install -q delta
  %pip install pyngrok

Pass the config k,v pairs and get a spark session object

In [19]:
import findspark
import pyspark
import os

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/drive/MyDrive/Github/spark-3.2.1-bin-hadoop3.2"

findspark.init()
MAX_MEMORY="8g"
maven_coords = [
    "org.apache.spark:spark-avro_2.12:3.2.1",
    "io.delta:delta-core_2.12:2.0.0rc1",
    "org.xerial:sqlite-jdbc:3.36.0.3",
    "graphframes:graphframes:0.8.2-spark3.2-s_2.12",
    "com.acervera.osm4scala:osm4scala-spark3-shaded_2.12:1.0.8",
]
spark = (pyspark.sql.SparkSession.builder.appName("MyApp")
    .config("spark.jars.packages", ",".join(maven_coords))
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    .config("spark.executor.memory", MAX_MEMORY)
    .config("spark.driver.memory", MAX_MEMORY)
    .config('spark.ui.port', '4050')
    .enableHiveSupport()
    .getOrCreate()
    )
spark

In [29]:
from pyngrok import ngrok, conf
import getpass

print("Enter your authtoken, which can be copied "
"from https://dashboard.ngrok.com/get-started/your-authtoken")
conf.get_default().auth_token = getpass.getpass()

ui_port = 4040
public_url = ngrok.connect(ui_port).public_url
print(f" * ngrok tunnel \"{public_url}\" -> \"http://127.0.0.1:{ui_port}\"")

Enter your authtoken, which can be copied from https://dashboard.ngrok.com/get-started/your-authtoken
··········




 * ngrok tunnel "https://50f0-34-68-94-133.ngrok-free.app" -> "http://127.0.0.1:4040"


Setup is complete. At this point you have started a spark application and able to access the application-ui using the url above.
You can start writing your data transformation code below...

# Spark SQL API: Create a temporary view from the csv data source in spark_play/netflix_titles.csv.

Use spark.sql("query") to access the view with SQL syntax

In [26]:
spark.read.format("csv").option("header", "true").load('./spark_play/netflix_titles.csv').createOrReplaceTempView("movies")
spark.sql("select * from movies limit 5").show()

+-------+-------+-----+-----------------+--------------------+-------------+-----------------+------------+------+---------+--------------------+--------------------+
|show_id|   type|title|         director|                cast|      country|       date_added|release_year|rating| duration|           listed_in|         description|
+-------+-------+-----+-----------------+--------------------+-------------+-----------------+------------+------+---------+--------------------+--------------------+
|     s1|TV Show|   3%|             null|João Miguel, Bian...|       Brazil|  August 14, 2020|        2020| TV-MA|4 Seasons|International TV ...|In a future where...|
|     s2|  Movie| 7:19|Jorge Michel Grau|Demián Bichir, Hé...|       Mexico|December 23, 2016|        2016| TV-MA|   93 min|Dramas, Internati...|After a devastati...|
|     s3|  Movie|23:59|     Gilbert Chan|Tedd Chan, Stella...|    Singapore|December 20, 2018|        2011|     R|   78 min|Horror Movies, In...|When an army recr...

# Pyspark dataframe API: Create a dataframe from the csv data source

Use pyspark syntax to do data transformation

In [27]:
movies_df = spark.read.format("csv").option("header", "true").load('./spark_play/netflix_titles.csv')
movies_df.show()

+-------+-------+------+--------------------+--------------------+--------------------+-----------------+------------+------+---------+--------------------+--------------------+
|show_id|   type| title|            director|                cast|             country|       date_added|release_year|rating| duration|           listed_in|         description|
+-------+-------+------+--------------------+--------------------+--------------------+-----------------+------------+------+---------+--------------------+--------------------+
|     s1|TV Show|    3%|                null|João Miguel, Bian...|              Brazil|  August 14, 2020|        2020| TV-MA|4 Seasons|International TV ...|In a future where...|
|     s2|  Movie|  7:19|   Jorge Michel Grau|Demián Bichir, Hé...|              Mexico|December 23, 2016|        2016| TV-MA|   93 min|Dramas, Internati...|After a devastati...|
|     s3|  Movie| 23:59|        Gilbert Chan|Tedd Chan, Stella...|           Singapore|December 20, 2018|     

From a dataframe, Write to a Delta table

In [None]:
#from delta.tables import DeltaTable
import delta

df = spark.createDataFrame([{'s':'hello world','i':1234}])

(df.write.format('delta')
         .mode('overwrite')
         .option("mergeSchema", "true")
         .save('./delta_hello_world')
)


Read back the data from the delta Table. Here spark sql api is used to read.

In [None]:
spark.read.format("delta").load('./delta_hello_world').createOrReplaceTempView("delta_hello_world")
df2 = spark.sql("""
  select * from delta_hello_world
""")
df2.show()

+----+-----------+
|   i|          s|
+----+-----------+
|1234|hello world|
+----+-----------+

