# Important classes of Spark SQL and DataFrames:

    - :class:`pyspark.sql.SQLContext`
      Main entry point for :class:`DataFrame` and SQL functionality.
    - :class:`pyspark.sql.DataFrame`
      A distributed collection of data grouped into named columns.
    - :class:`pyspark.sql.Column`
      A column expression in a :class:`DataFrame`.
    - :class:`pyspark.sql.Row`
      A row of data in a :class:`DataFrame`.
    - :class:`pyspark.sql.HiveContext`
      Main entry point for accessing data stored in Apache Hive.
    - :class:`pyspark.sql.GroupedData`
      Aggregation methods, returned by :func:`DataFrame.groupBy`.
    - :class:`pyspark.sql.DataFrameNaFunctions`
      Methods for handling missing data (null values).
    - :class:`pyspark.sql.DataFrameStatFunctions`
      Methods for statistics functionality.
    - :class:`pyspark.sql.functions`
      List of built-in functions available for :class:`DataFrame`.
    - :class:`pyspark.sql.types`
      List of data types available.
    - :class:`pyspark.sql.Window`
      For working with window functions.

In [1]:
# Add / Run this cell first to fix the issue with environment variables for Java not being set properly
# Even though they're in .bash_profile
# TODO: Fix this

%run '../../spark_variables.ipynb'

In [2]:
from pyspark import SparkContext
#sc.stop()
sc = SparkContext(master="local[3]") 

from pyspark import SparkContext
from pyspark.sql import *
sqlContext = SQLContext(sc)

## DataframeStatFunctions

Methods for statistics functionality. [documented here](http://takwatanabe.me/pyspark/generated/generated/pyspark.sql.DataFrameStatFunctions.html)

* **approxQuantile(col, probabilities, relativeError)**	Calculates the approximate quantiles of a numerical column of a DataFrame.
* **corr(col1, col2[, method])**	Calculates the correlation of two columns of a DataFrame as a double value.
* **cov(col1, col2)**	Calculate the sample covariance for the given columns, specified by their names, as a double value.
* **crosstab(col1, col2)**	Computes a pair-wise frequency table of the given columns.
* **freqItems(cols[, support])**	Finding frequent items for columns, possibly with false positives.
* **sampleBy(col, fractions[, seed])**	Returns a stratified sample without replacement based on the fraction given on each stratum.

In [3]:
DataFrameStatFunctions.approxQuantile?

[0;31mSignature:[0m
[0mDataFrameStatFunctions[0m[0;34m.[0m[0mapproxQuantile[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mself[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcol[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mprobabilities[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrelativeError[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Calculates the approximate quantiles of numerical columns of a
DataFrame.

The result of this algorithm has the following deterministic bound:
If the DataFrame has N elements and if we request the quantile at
probability `p` up to error `err`, then the algorithm will return
a sample `x` from the DataFrame so that the *exact* rank of `x` is
close to (p * N). More precisely,

  floor((p - err) * N) <= rank(x) <= ceil((p + err) * N).

This method implements a variation of the Greenwald-Khanna
algorithm (with some speed optimizations). The algorithm was first
present in [[http://dx.doi.org/10.1

In [4]:
DataFrameStatFunctions.corr?

[0;31mSignature:[0m [0mDataFrameStatFunctions[0m[0;34m.[0m[0mcorr[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mcol1[0m[0;34m,[0m [0mcol2[0m[0;34m,[0m [0mmethod[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Calculates the correlation of two columns of a DataFrame as a double value.
Currently only supports the Pearson Correlation Coefficient.
:func:`DataFrame.corr` and :func:`DataFrameStatFunctions.corr` are aliases of each other.

:param col1: The name of the first column
:param col2: The name of the second column
:param method: The correlation method. Currently only supports "pearson"

.. versionadded:: 1.4
[0;31mFile:[0m      /usr/local/Cellar/apache-spark/2.4.4/libexec/python/pyspark/sql/dataframe.py
[0;31mType:[0m      function


In [5]:
DataFrameStatFunctions.cov?

[0;31mSignature:[0m [0mDataFrameStatFunctions[0m[0;34m.[0m[0mcov[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mcol1[0m[0;34m,[0m [0mcol2[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Calculate the sample covariance for the given columns, specified by their names, as a
double value. :func:`DataFrame.cov` and :func:`DataFrameStatFunctions.cov` are aliases.

:param col1: The name of the first column
:param col2: The name of the second column

.. versionadded:: 1.4
[0;31mFile:[0m      /usr/local/Cellar/apache-spark/2.4.4/libexec/python/pyspark/sql/dataframe.py
[0;31mType:[0m      function


In [6]:
DataFrameStatFunctions.crosstab?

[0;31mSignature:[0m [0mDataFrameStatFunctions[0m[0;34m.[0m[0mcrosstab[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mcol1[0m[0;34m,[0m [0mcol2[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Computes a pair-wise frequency table of the given columns. Also known as a contingency
table. The number of distinct values for each column should be less than 1e4. At most 1e6
non-zero pair frequencies will be returned.
The first column of each row will be the distinct values of `col1` and the column names
will be the distinct values of `col2`. The name of the first column will be `$col1_$col2`.
Pairs that have no occurrences will have zero as their counts.
:func:`DataFrame.crosstab` and :func:`DataFrameStatFunctions.crosstab` are aliases.

:param col1: The name of the first column. Distinct items will make the first item of
    each row.
:param col2: The name of the second column. Distinct items will make the column names
    of the DataFrame.

.. versionadded:: 1.4
[0;31

In [7]:
DataFrameStatFunctions.freqItems?

[0;31mSignature:[0m [0mDataFrameStatFunctions[0m[0;34m.[0m[0mfreqItems[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mcols[0m[0;34m,[0m [0msupport[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Finding frequent items for columns, possibly with false positives. Using the
frequent element count algorithm described in
"http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou".
:func:`DataFrame.freqItems` and :func:`DataFrameStatFunctions.freqItems` are aliases.

.. note:: This function is meant for exploratory data analysis, as we make no
    guarantee about the backward compatibility of the schema of the resulting DataFrame.

:param cols: Names of the columns to calculate frequent items for as a list or tuple of
    strings.
:param support: The frequency with which to consider an item 'frequent'. Default is 1%.
    The support must be greater than 1e-4.

.. versionadded:: 1.4
[0;31mFile:[0m      /usr/loca

In [8]:
DataFrameStatFunctions.sampleBy?

[0;31mSignature:[0m [0mDataFrameStatFunctions[0m[0;34m.[0m[0msampleBy[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mcol[0m[0;34m,[0m [0mfractions[0m[0;34m,[0m [0mseed[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Returns a stratified sample without replacement based on the
fraction given on each stratum.

:param col: column that defines strata
:param fractions:
    sampling fraction for each stratum. If a stratum is not
    specified, we treat its fraction as zero.
:param seed: random seed
:return: a new DataFrame that represents the stratified sample

>>> from pyspark.sql.functions import col
>>> dataset = sqlContext.range(0, 100).select((col("id") % 3).alias("key"))
>>> sampled = dataset.sampleBy("key", fractions={0: 0.1, 1: 0.2}, seed=0)
>>> sampled.groupBy("key").count().orderBy("key").show()
+---+-----+
|key|count|
+---+-----+
|  0|    5|
|  1|    9|
+---+-----+

.. versionadded:: 1.5
[0;31mFile:[0m      /usr/local/Cellar