# Setup

This is a quick piece of code showing how to fire [Apache Spark](https://spark.apache.org/) and then [Koalas](https://koalas.readthedocs.io/en/latest/index.html) on Google Colab. Koalas "makes data scientists more productive when interacting with big data, by implementing the pandas DataFrame API on top of Apache Spark" (word of the creators).

First, we install the required packages and set environment variables:

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark
!pip install koalas

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

# Initializing

Next, we start a Spark session. This step is needed in order to interact with the Spark API:

In [4]:
from pyspark.sql.session import SparkSession

spark = SparkSession.builder\
    .master('local[*]')\
    .getOrCreate()

In [10]:
import pandas as pd
import databricks.koalas as ks

Now, we show and example of how to move data around the different frameworks:

In [23]:
# A simple pandas DataFrame
df = pd.DataFrame({
    'numbers': [1,2,3],
    'strings': ['a','b','c']
})
type(df)

pandas.core.frame.DataFrame

In [18]:
# A Spark DataFrame/RDD, now ready for big data analysis
sdf = spark.createDataFrame(df)
type(sdf)

pyspark.sql.dataframe.DataFrame

In [24]:
# A Koalas DataFrame! Now you can use most of the pandas' idiom to work with big data.
kdf = sdf.to_koalas()
type(kdf)

databricks.koalas.frame.DataFrame