## Learning PySpark
### Chapter 4: DataFrames Functions
This notebook contains sample code from Chapter 4 of [Learning PySpark]() focusing on PySpark and DataFrame Functions

#### Generating data to be used for the various functions

In [3]:
# Generate our own JSON data 
#   This way we don't have to access the file system yet.
stringJSONRDD = sc.parallelize((""" 
  { "id": 123,
    "name": "Katie",
    "age": 19,
    "eyeColor": "brown",
    "goldDate": "2005-01-22",
    "level": -1
  }""",
   """{
    "id": 234,
    "name": "Michael",
    "age": 22,
    "eyeColor": "green",
    "goldDate": "2011-11-12",
    "level": -2
  }""", 
  """{
    "id": 345,
    "name": "Simone",
    "age": 23,
    "eyeColor": "blue",
    "goldDate": "2008-06-07",
    "level": -3
  }""")
)

In [4]:
# Create DataFrame
df = spark.read.json(stringJSONRDD)

In [5]:
# Include pyspark.sql.functions
from pyspark.sql.functions import *

### Display your data (and schema)

In [7]:
display(df)

In [8]:
df.printSchema()

### Math Functions

In [10]:
# abs($col1)
#   Absolute value
df.select(
  df.level,
  abs(df.level).alias('abs_level')
).show()

In [11]:
# acos($col1)
#   Calculates the cosine inverse
df.select(
  (1.*df.age/100).alias('age~'),
  acos(1.*df.age/100).alias('acos_age~')
).show()

In [12]:
# approxQuantile($col1, probabilities, relativeError)
#   Using approxQuantile to calculate the median
df.approxQuantile("age", [0.5], 0)

In [13]:
# corr($col1, $col2, $method)
#   Use corr to calculate the correlation of two columns of a DataFrame as a double value. 
#   Currently only supports the Pearson Correlation Coefficient.
df.corr("id", "age")

In [14]:
# count()
#   Get the count of rows within your DataFrame
df.count()

In [15]:
# cov($col1, $col2)
#   Calculate the sample covariance for the given columns, specified by their names, as a double value 
df.cov("id", "age")

In [16]:
# crosstab($col1, $col2)
#   Computes a contingency (pair-wise frequency) table of the given columns.
#   The number of distinct values for each column should be less than 1e4. At most 1e6 non-zero pair frequencies will be returned. 
# df.crosstab("name", "eyeColor").show()
display(df.crosstab("name", "eyeColor"))

In [17]:
# cube(*cols)
#   Create a multi-dimensional cube for the DataFrame using the specified columns, so we can run aggregation on them.
# df.cube("name", "eyeColor").count().orderBy("name", "eyeColor").show()
display(df.cube("name", "eyeColor").count().orderBy("name", "eyeColor"))

In [18]:
# describe(*cols)
#   Computes statistics for numeric columns.  
#   If columns are not specified, then all numeric columns are  calculated
df.describe("age", "level").show()

In [19]:
# dtypes
#   Returns all column names and their data types as a list.
df.dtypes