# DataFrames 1

Working with Pandas and Spark dataframes

## Step 1 - Initialize Spark

In [None]:
try:
    spark
except NameError:
    import findspark
    findspark.init()  # uses SPARK_HOME
    print("Spark found in : ", findspark.find())

    import pyspark
    from pyspark import SparkConf
    from pyspark.sql import SparkSession

    # use a unique tmep dir for warehouse dir, so we can run multiple spark sessions in one dir
    import tempfile
    tmpdir = tempfile.TemporaryDirectory()

    config = ( SparkConf()
             .setAppName("TestApp")
             .setMaster("local[*]")
             .set('executor.memory', '2g')
             .set('spark.sql.warehouse.dir', tmpdir.name)
             .set("some_property", "some_value") # another example
             )

    spark = SparkSession.builder.config(conf=config).getOrCreate()

print('Spark UI running on port ' + spark.sparkContext.uiWebUrl.split(':')[2])

## Step 2 - Create a Pandas DataFrame

Here, we will create a Pandas DF and then convert it to Spark.

In [None]:
import pandas as pd

pd_df = pd.DataFrame ({'col1' : ['A', 'B', 'C', 'D'], 
                       'col2' : [10, 20, 30, 40], 
                       'col3' : [1.1, 2.2, 3.3, 4.4]})
pd_df

## Step 3 - Convert it to Spark DF

In [None]:
spark_df = spark.createDataFrame(pd_df)
spark_df.printSchema()
spark_df.show()

## Step 4 - Convering from Spark --> Pandas

In [None]:
# this is spark df
summary = spark_df.describe()

summary_pd = summary.toPandas()

summary_pd

### Spark dataframe <--> Pandas Dataframe Conversion

You can easily convert Spark DF to Pandas DF by using  `toPandas()` method.

But be certain not to convert large Spark dataframes into Pandas DF.  You will run out of memory. 

![](../assets/images/spark-pandas-dataframe-conversion-1.png)

Here `describe` gives us a summary df, which is quite small, so we can convert it to Pandas comfortably.

If you are converting arbitary Spark DFs into  Pandas DFs you need to do something like this

`spark_df.limit(10).toPandas()`

So this way you protect yourself from converting huge DF into Pandas.

In [None]:
spark.range(1,10000000).limit(10).toPandas()