# Welcome to DataFrames

The Spark DataFrame API sits on top of the RDD API to provide a SQL-like interface.

- [Row](https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.Row.html) - A row in DataFrame
- [toDf](https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrame.toDF.html) - Returns a new `DataFrame` that with specified column names
- [show(n = 20)](https://spark.apache.org/docs/3.2.0/api/python/reference/api/pyspark.sql.DataFrame.show.html) - Prints the first n rows to the console

In [None]:
from pyspark.context import SparkContext
from pyspark.sql import Row

spark_context = SparkContext.getOrCreate()
rdd = spark_context.parallelize(
    [
        Row("Grapes",   "500g", 1.23),
        Row("Cheddar",  "2Mg",  5600.0),
        Row("Crackers", "16",   10.20),
    ]
)
shopping_dataframe = rdd.toDF(['name', 'quantity', 'price'])

shopping_dataframe.show()

- [filter](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.filter.html#pyspark.sql.DataFrame.filter) - Filters rows using the given condition.
- [col](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.col.html) - Returns a `Column` based on the given column name.

In [None]:
from pyspark.sql.functions import col

shopping_dataframe.filter(col('price') < 10).show()

## Exercises

1. Using the existing `shopping_dataframe` print to the console a new DataFrame which contains an additional column called "price_in_pennies" calculated by multiplying price by 100.
    <details>
      <summary>Hint</summary>
      Search for "withColumn" DataFrame method within the API docs, see Resources below.
    </details>
1. Using the existing `shopping_dataframe` print to the console a new DataFrame without the name column.
    <details>
      <summary>Hint</summary>
      Search for "select" DataFrame method within the API docs, see Resources below.
    </details>
1. Using the existing `shopping_dataframe` print to the console DataFrame with one column "name" where the most expensive item is listed first.
    <details>
      <summary>Hint</summary>
      Search for "sort" DataFrame method within the API docs, see Resources below.
    </details>
1. Save the `shopping_dataframe` to S3 in the JSON format using the path "s3://shopping-a276085/your_name/".  Then navigate to the S3 interface and download a copy of the saved shopping.
    <details>
      <summary>Hint</summary>
      Search for "write" DataFrame property, and on the returned object a "json" method see Resources below.
    </details>

## Resources

- [DataFrame API docs](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html)
- [Column API docs](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html)