# Spark

This page considers the python SDK for Spark.

In [1]:
from pyspark.sql import SparkSession

spark_session = SparkSession.builder.appName('Temp').getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/09/19 10:34:05 WARN Utils: Your hostname, user-ThinkPad-E16-Gen-2, resolves to a loopback address: 127.0.1.1; using 10.202.22.210 instead (on interface enp0s31f6)
25/09/19 10:34:05 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/19 10:34:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/09/19 10:34:06 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


## Dataframe

Spark SQL contains a DataFrame objects that provide a way to interact with tabular data.

You can define a data frame: 

- Directly from your code using the `createDataFrame` method of the session object.
- Using some special methods to read from external sources stored in the `read` attribute of the session.

---

The following cell defines the Spark dataset, which is formatted so that each row is a tuple whose values correspond to each column. And shows it.

In [14]:
df = spark_session.createDataFrame(
    data=[("Alice", 25), ("Bob", 30), ("Cathy", 35)]
)
df.show()

                                                                                

+-----+---+
|   _1| _2|
+-----+---+
|Alice| 25|
|  Bob| 30|
|Cathy| 35|
+-----+---+



## Read csv

Use the `read.csv` method of the spark session to read a CSV file.

---

The following cell reads the `spark.csv` file that I prepared earlier.

In [4]:
spark = SparkSession.builder.appName("Temp").getOrCreate()
df = spark.read.csv(
    "spark_files/scv_example.csv",
    header=True,
    inferSchema=True,
    multiLine=True,
    escape=','
)
display(df)

DataFrame[Name: string,  Age: double,  Salary: double]

### Shcema

Use the `schema` argument to define the schema. The schema can be specified as a simple string that matches column names with their expected data types.

---

The following cell shows the matching of the `int` data type to the `Age` column instead of the default `double` data type.

In [10]:
schema = """
Name string,
Age int,
Salary double
"""

spark.read.csv(
    "spark_files/scv_example.csv",
    schema=schema
)

DataFrame[Name: string, Age: int, Salary: double]