- ["Quickstart: DataFrame"](https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html) -- one page guide by the official docs

- [`pyspark.sql.functions`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html) -- discover useful column-wise functions here

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

Question 1

In [2]:
df = spark.createDataFrame([
    (x,) for x in range(1, 5)
], schema='x int')

In [11]:
from pyspark.sql.functions import pow

df.withColumn("x squared", pow(df.x, 2)).show()

+---+---------+
|  x|x squared|
+---+---------+
|  1|      1.0|
|  2|      4.0|
|  3|      9.0|
|  4|     16.0|
+---+---------+



Question 2

In [12]:
from pyspark.sql.functions import max

df.select(max(df.x)).show()

+------+
|max(x)|
+------+
|     4|
+------+



Question 3

In [14]:
from pyspark.sql.functions import avg

df.select(avg(df.x)).show()

+------+
|avg(x)|
+------+
|   2.5|
+------+



Question 4

In [23]:
file_path = 'data/foo.csv'

df.write.csv(file_path, header=True, mode='overwrite') # 'overwrite' if the file alr exists

spark.read.csv(file_path, header=True).show()

+---+
|  x|
+---+
|  1|
|  2|
|  3|
|  4|
+---+



Question 5

[Relevant StackOverflow post](https://stackoverflow.com/questions/48927271/count-number-of-words-in-a-spark-dataframe)

`read.text()` will read the text file into a DataFrame. Each line will be stored in a separate row (as a string).

[`functions.split()`](https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.split) will split each line into an array (`Row`) of words.
[`functions.size()`](https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.size) will measure the size of this `Row` instance for each line. 

Finally sum the word-counts for all individual lines using [`functions.sum()`](https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.size). 

In [47]:
from pyspark.sql.functions import size, split, sum

file_path = 'data/shakespeare.txt'

df = spark.read.text(file_path) 

df = df.withColumn('words', size(split(df.value, ' '))) # append a column for words
df.select(sum(df.words)).show()

+----------+
|sum(words)|
+----------+
|       256|
+----------+

