## Calculating Cramér's V from a Spark DataFrame

### What is Cramér's V?
Cramér's V is a statistical measure of an association between two nominal variables, giving a value between 0 and 1 inclusive. Here 0 would indicate no association and 1 indicates a strong association between the two variables. It is based on Pearson's chi-square statistics.

We calculate Cramér's V as follows:

$$ \text{Cramer's V} = \sqrt{\dfrac{\dfrac{\chi^2}{n}}{\min (c-1,r-1)}}, $$ 
where:
- $\chi^2$ is the Chi-squared statistic,
- $n$ is the number of samples,
- $r$ is the number of rows,
- $c$ is the number of columns.

In some literature you may see the Phi coefficient used ($\phi$), where $\phi^2 = \chi^2/n$.

### Cramér's V in Spark:
Although there is not an in built method for calculating this statistic in base python, is it reasonably straightforward using `numpy` and `scipy.stats` packages. An example of this can be found [online here](https://www.statology.org/cramers-v-in-python/).
A similar example for R is linked [here](https://www.statology.org/cramers-v-in-r/).

Due to Pyspark and SparklyR's differences to classical python and R, we need to consider how we can calculate Cramér's V when using Spark DataFrames.
First we will import the needed packages and start a spark session. 

In [1]:
import numpy as np
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import scipy.stats as stats

spark = (SparkSession.builder.master("local[2]")
         .appName("cramer-v")
         .getOrCreate())

For this example we will create some dummy data using the `F.rand()` spark function which will generate random numbers. For repeatable results we will set a seed.
The `F.ceil()` function will round the number up to the nearest integer. 
After creating the dummy data, we will show the first 5 rows.


In [4]:
# Setting random number seed
seed_no = 42

# Creating spark dataframe
df = spark.range(100)
df = df.union(df)
df = df.withColumn("value", F.ceil(F.rand(seed_no) * -10))
df.show(5)

+---+-----+
| id|value|
+---+-----+
|  0|   -6|
|  1|   -5|
|  2|   -8|
|  3|   -2|
|  4|   -6|
+---+-----+
only showing top 5 rows



Using the `.crosstab()` function, we can calculate a pair-wise frequency table of the `id` and `value` columns (a.k.a. contingency table). We will generate this table and convert it to a pandas DataFrame.

In [8]:
freq_spark = df.crosstab('id','value')
freq_pandas = freq_spark.toPandas()
freq_pandas.head()

Unnamed: 0,id_value,-1,-2,-3,-4,-5,-6,-7,-8,-9,0
0,7,0,0,0,0,0,0,0,1,0,1
1,51,0,1,0,0,0,1,0,0,0,0
2,15,0,0,0,1,0,0,1,0,0,0
3,54,0,0,1,0,0,1,0,0,0,0
4,11,0,0,0,0,0,1,0,0,1,0


As we have brought our pair-wise frequency table into our local environment, we can now utilise the `scipy.stats.chi2_contingency()` function. From the [documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html), we can see that this takes an `array_like` input, as such we should convert our table into a Numpy array:

In [9]:
freq_numpy = np.array(freq_pandas)

The example python code for this has been adapted from the article linked previously. We have opted to restructured this into a function, so it can be ran a few times. After defining the function, we will attempt to run this on our Numpy array `freq_numpy`. We have implemented a `try`-`except` statement to capture the errors and not stop the full script from running.

In [15]:
def get_cramer_v(freq_numpy):
    #Chi-squared test statistic, sample size, and minimum of rows and columns
    X2 = stats.chi2_contingency(freq_numpy, correction=False)[0]
    n = np.sum(freq_numpy)
    minDim = min(freq_numpy.shape)-1

    #calculate Cramer's V 
    V = np.sqrt((X2/n) / minDim)
    return V

try:
    get_cramer_v(freq_numpy)
except TypeError as e:
    print(e)

'<' not supported between instances of 'str' and 'int'


This `TypeError` says we have strings where we need integers. We should return to our spark pair-wise frequency table to double check the data types.

In [19]:
freq_spark.printSchema()

root
 |-- id_value: string (nullable = false)
 |-- -1: long (nullable = false)
 |-- -2: long (nullable = false)
 |-- -3: long (nullable = false)
 |-- -4: long (nullable = false)
 |-- -5: long (nullable = false)
 |-- -6: long (nullable = false)
 |-- -7: long (nullable = false)
 |-- -8: long (nullable = false)
 |-- -9: long (nullable = false)
 |-- 0: long (nullable = false)



It looks like the `id_value` column is a type string. So the first variable passed to `.crosstab()` comes out as string. Let's fix that using the `.cast` function.

In [20]:
from pyspark.sql.types import IntegerType
freq_spark = freq_spark.withColumn("id_value", F.col("id_value").cast(IntegerType()))

In [22]:
freq_spark.printSchema()

root
 |-- id_value: integer (nullable = true)
 |-- -1: long (nullable = false)
 |-- -2: long (nullable = false)
 |-- -3: long (nullable = false)
 |-- -4: long (nullable = false)
 |-- -5: long (nullable = false)
 |-- -6: long (nullable = false)
 |-- -7: long (nullable = false)
 |-- -8: long (nullable = false)
 |-- -9: long (nullable = false)
 |-- 0: long (nullable = false)



Now `id_value` is `int`. Carry on with rest of processing

In [23]:
freq_pandas = freq_spark.toPandas()
freq_numpy = np.array(freq_pandas)
print(get_cramer_v(freq_numpy))

0.1935164295375403


This time we get a result.

Also note that all frequencies must be positive. If there is a negative value in freq_numpy, we will get an error. 

To demonstrate, let's plant some negative values in `freq_numpy`, then calculate the statistic again.

In [24]:
freq_numpy[0][1] = -1
freq_numpy[10, 4] = -5

try:
    get_cramer_v(freq_numpy)
except ValueError as e:
    print(e)

All values in `observed` must be nonnegative.


This `ValueError` tells us we have negative values. How many and where are the negative values?

In [25]:
len(freq_numpy[freq_numpy < 0])

2

Two negative values. Where are they?

In [26]:
rows, cols = np.where(freq_numpy < 0)
print(rows, cols)

[ 0 10] [1 4]


They're in locations [0,1] and [10,4]. Let's see what the values are

In [20]:
freq_numpy[rows,cols]

array([-1, -5])

### Further resources

Spark at the ONS Articles:


PySpark Documentation:
- [`.rand()`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.rand.html)
- [`.ceil()`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.ceil.html)
- [`.crosstab()`](https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.DataFrame.crosstab.html)

Python Documentation:
- [`subprocess`](https://docs.python.org/3/library/subprocess.html) 

sparklyr and tidyverse Documentation:
- [`sdf_coalesce()`](https://spark.rstudio.com/packages/sparklyr/latest/reference/sdf_coalesce.html)
- [`sdf_repartition()`](https://spark.rstudio.com/packages/sparklyr/latest/reference/sdf_repartition.html)
- [`sdf_num_partitions()`](https://spark.rstudio.com/packages/sparklyr/latest/reference/sdf_num_partitions.html)
- [`spark_apply()`](https://spark.rstudio.com/packages/sparklyr/latest/reference/spark_apply.html)

Spark Documentation:
- [Spark Configuration](https://spark.apache.org/docs/latest/configuration.html):
    - [Execution Behaviour](https://spark.apache.org/docs/latest/configuration.html#execution-behavior)
    - [Runtime SQL Configuration](https://spark.apache.org/docs/latest/configuration.html#runtime-sql-configuration)