<a href="https://colab.research.google.com/github/cagBRT/Statistics-with-Python/blob/main/Statistics_with_PySpark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using PySpark to calculate Summary Statistics

PySpark Python APIs<br>
PySpark provides a Python library called pyspark that exposes the Spark functionality through Python APIs. This library allows developers to interact with Spark's core components, such as SparkContext, DataFrame, and RDD, using Python code.<br>
The pyspark library provides a high-level interface that abstracts away the complexities of distributed computing, allowing Python developers to focus on data analysis and manipulation tasks.<br>
PySpark supports both the interactive PySpark shell (pyspark) and Python script execution using the spark-submit command.<br>


In [None]:
!pip install pyspark

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4],
        ['A', 'East', 8, 9],
        ['A', 'East', 10, 3],
        ['B', 'West', 6, 12],
        ['B', 'West', 6, 4],
        ['C', 'East', 5, 2]]

#define column names
columns = ['team', 'conference', 'points', 'assists']

#create dataframe using data and column names
df = spark.createDataFrame(data, columns)

#view dataframe
df.show()

**Calculate summary statistics for all columns**

The output for the summary command is:

count: The number of values in the column<br>
mean: The mean valuebr>
stddev: The standard deviation of values<br>
min: The minimum value<br>
25%: The 25th percentile <br>
50%:The 50th percentile (this is also the median)<br>
75%: The 75th percentile<br>
max: The max value<br>

In [None]:
df.summary().show()

**Summary statistics for numeric columns.**

In [None]:
#identify numeric columns in DataFrame
numeric_cols = [c for c, t in df.dtypes if t.startswith('string')==False]

#calculate summary statistics for only the numeric columns
df.select(*numeric_cols).summary().show()

In [None]:
from pyspark.sql.functions import mean, min, max
df.select([mean('points'), min('points'), max('points')]).show()

**Measuring Covariance**<br>
Covariance is a measure of how two variables change with respect to each other.<br>

A positive number would mean that there is a tendency that as one variable increases, the other increases as well.<br>

A negative number would mean that as one variable increases, the other variable has a tendency to decrease.<br>

In [None]:
from pyspark.sql.functions import rand

df.stat.cov('points', 'assists')

**Cross Tabulation**<br>

Cross Tabulation provides a table of the frequency distribution for a set of variables. <br>

Cross-tabulation is a powerful tool in statistics that is used to observe the statistical significance (or independence) of variables.

In [None]:
# Create a DataFrame with two columns (name, item)
names = ["Alice", "Bob", "Mike", "Toby"]
items = ["milk", "bread", "butter", "apples", "oranges", "walnuts"]

df = spark.createDataFrame([(names[i % 4], items[i % 6]) for i in range(100)], ["name", "item"])

# Take a look at the first 10 rows.
df.show(10)

The cardinality of columns we run crosstab on cannot be too big. <br>

The number of distinct “name” and “item” cannot be too large. Just imagine if “item” contains 1 billion distinct entries: how would you fit that table on your screen?

In [None]:
df.stat.crosstab("name", "item").show()

**Frequent Items**<br>

Figuring out which items are frequent in each column can be very useful to understand a dataset.

PySpark implemented a one-pass algorithm proposed by Karp et al. This is a fast, approximate algorithm that always returns all the frequent items that appear in a user-specified minimum proportion of rows.<br>

**Note**: that the result might contain false positives, i.e. items that are not frequent.

In [None]:
df = spark.createDataFrame([(1, 2, 3) if i % 2 == 0 else (i, 2 * i, i % 4) for i in range(100)], ["a", "b", "c"])

df.show(10)

Given the above DataFrame, the following code finds the frequent items that show up 40% of the time for each column:

In [None]:
freq = df.stat.freqItems(["a", "b", "c"], 0.4) #40%

freq.collect()[0]
#Column a: 1 appeared 99 times

**Frequent items for column combinations**<br>

The combination of “a=99 and b=198”, and “a=1 and b=2” appear frequently in this dataset. Note that “a=11 and b=22” is a false positive.


In [None]:
from pyspark.sql.functions import struct

freq = df.withColumn('ab', struct('a', 'b')).stat.freqItems(['ab'], 0.4)

freq.collect()[0]