## Calculating Cramér's V from a Spark DataFrame

### What is Cramér's V?
Cramér's V is a statistical measure of an association between two nominal variables, giving a value between 0 and 1 inclusive. Here 0 would indicate no association and 1 indicates a strong association between the two variables. It is based on Pearson's chi-square statistics.

We calculate Cramér's V as follows:

$$ \text{Cramer's V} = \sqrt{\dfrac{\dfrac{\chi^2}{n}}{\min (c-1,r-1)}}, $$ 
where:
- $\chi^2$ is the Chi-squared statistic,
- $n$ is the number of samples,
- $r$ is the number of rows,
- $c$ is the number of columns.

In some literature you may see the Phi coefficient used ($\phi$), where $\phi^2 = \chi^2/n$.

### Cramér's V in Spark:
Although there is not an in built method for calculating this statistic in base python, is it reasonably straightforward using `numpy` and `scipy.stats` packages. An example of this can be found [online here](https://www.statology.org/cramers-v-in-python/).
A similar example for R is linked [here](https://www.statology.org/cramers-v-in-r/).

Due to Pyspark and SparklyR's differences to classical python and R, we need to consider how we can calculate Cramér's V when using Spark DataFrames.
First we will import the needed packages and start a spark session. 

In [1]:
import numpy as np
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import scipy.stats as stats

spark = (SparkSession.builder.master("local[2]")
         .appName("cramer-v")
         .getOrCreate())

```R
library(sparklyr)
library(dplyr)

default_config <- sparklyr::spark_config()

sc <- sparklyr::spark_connect(
    master = "local[2]",
    app_name = "cramer-v",
    config = default_config)
```

For this example we will create some dummy data using the `F.rand()` spark function which will generate random numbers. For repeatable results we will set a seed.
The `F.ceil()` function will round the number up to the nearest integer. 
After creating the dummy data, we will show the first 5 rows.


In [44]:
# Setting random number seed
seed_no = 42

# Creating spark dataframe
df = spark.range(100)
df = df.union(df)
df = df.withColumn("value", F.ceil(F.rand(seed_no) * 10))
df.show(200)

+---+-----+
| id|value|
+---+-----+
|  0|    7|
|  1|    6|
|  2|    9|
|  3|    3|
|  4|    7|
|  5|    6|
|  6|   10|
|  7|    1|
|  8|   10|
|  9|    8|
| 10|    5|
| 11|    7|
| 12|    4|
| 13|    9|
| 14|    8|
| 15|    8|
| 16|    1|
| 17|    7|
| 18|    1|
| 19|    9|
| 20|    5|
| 21|    8|
| 22|    6|
| 23|    9|
| 24|    8|
| 25|    4|
| 26|    2|
| 27|    3|
| 28|    8|
| 29|    9|
| 30|    8|
| 31|    3|
| 32|    8|
| 33|    6|
| 34|    8|
| 35|    9|
| 36|    2|
| 37|    5|
| 38|    2|
| 39|    8|
| 40|    3|
| 41|    5|
| 42|    8|
| 43|    8|
| 44|    6|
| 45|    9|
| 46|    9|
| 47|   10|
| 48|    2|
| 49|    9|
| 50|    9|
| 51|    7|
| 52|    3|
| 53|    3|
| 54|    7|
| 55|    9|
| 56|    9|
| 57|    8|
| 58|    4|
| 59|    1|
| 60|    7|
| 61|    7|
| 62|    7|
| 63|    7|
| 64|   10|
| 65|    2|
| 66|    8|
| 67|    1|
| 68|    9|
| 69|    1|
| 70|    8|
| 71|   10|
| 72|    7|
| 73|    5|
| 74|    2|
| 75|    8|
| 76|    1|
| 77|    6|
| 78|    8|
| 79|    5|
| 80

```R
seed_no <- 42L
df = sparklyr::sdf_seq(sc,from = 1, to = 100)
df <- sdf_bind_rows(df,df)
df <- df %>% sparklyr::mutate(value = ceil(rand(seed_no) * -10)) %>%
        sparklyr::mutate(id = double(id))
df %>% head(5) %>% print()
```

Using the `.crosstab()` function, we can calculate a pair-wise frequency table of the `id` and `value` columns (a.k.a. contingency table). We will generate this table and convert it to a pandas DataFrame.

In [3]:
freq_spark = df.crosstab('id','value')
freq_pandas = freq_spark.toPandas()
freq_pandas.head()

Unnamed: 0,id_value,-1,-2,-3,-4,-5,-6,-7,-8,-9,0
0,7,0,0,0,0,0,0,0,1,0,1
1,51,0,1,0,0,0,1,0,0,0,0
2,15,0,0,0,1,0,0,1,0,0,0
3,54,0,0,1,0,0,1,0,0,0,0
4,11,0,0,0,0,0,1,0,0,1,0


As we have brought our pair-wise frequency table into our local environment, we can now utilise the `scipy.stats.chi2_contingency()` function. From the [documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html), we can see that this takes an `array_like` input, as such we should convert our table into a Numpy array:

```R
freq_spark <- sdf_crosstab(df, 'id', 'value')
freq_spark %>% head(5) %>% print()

freq_r <- freq_spark %>% collect() 

```

In [4]:
freq_numpy = np.array(freq_pandas)

The example python code for this has been adapted from the article linked previously. We have opted to restructured this into a function, so it can be ran a few times. After defining the function, we will attempt to run this on our Numpy array `freq_numpy`. We have implemented a `try`-`except` statement to capture the errors and not stop the full script from running.

In [48]:
def get_cramer_v(freq_numpy):
    #Chi-squared test statistic, sample size, and minimum of rows and columns
    X2 = stats.chi2_contingency(freq_numpy, correction=False)[0]
    n = np.sum(freq_numpy)
    minDim = min(freq_numpy.shape)-1

    #calculate Cramer's V 
    V = np.sqrt((X2/n) / minDim)
    return V

try:
    get_cramer_v(freq_numpy)
except TypeError as e:
    print(e)

This `TypeError` says we have strings where we need integers. We should return to our spark pair-wise frequency table to double check the data types.

In [6]:
freq_spark.printSchema()

root
 |-- id_value: string (nullable = false)
 |-- -1: long (nullable = false)
 |-- -2: long (nullable = false)
 |-- -3: long (nullable = false)
 |-- -4: long (nullable = false)
 |-- -5: long (nullable = false)
 |-- -6: long (nullable = false)
 |-- -7: long (nullable = false)
 |-- -8: long (nullable = false)
 |-- -9: long (nullable = false)
 |-- 0: long (nullable = false)



It looks like the `id_value` column is a type string. So the first variable passed to `.crosstab()` comes out as string. Let's fix that using the `.cast` function.

In [7]:
from pyspark.sql.types import IntegerType
freq_spark = freq_spark.withColumn("id_value", F.col("id_value").cast(IntegerType()))
freq_spark.printSchema()

root
 |-- id_value: integer (nullable = true)
 |-- -1: long (nullable = false)
 |-- -2: long (nullable = false)
 |-- -3: long (nullable = false)
 |-- -4: long (nullable = false)
 |-- -5: long (nullable = false)
 |-- -6: long (nullable = false)
 |-- -7: long (nullable = false)
 |-- -8: long (nullable = false)
 |-- -9: long (nullable = false)
 |-- 0: long (nullable = false)



Now that we have set `id_value` is `int`, we can carry on with rest of processing again.

In [50]:
freq_pandas = freq_spark.toPandas()
freq_numpy = np.array(freq_pandas)
print(get_cramer_v(freq_numpy))


0.1935164295375403


In [12]:
X2 = stats.chi2_contingency(freq_numpy, correction=False)[0]
n = np.sum(freq_numpy)
minDim = min(freq_numpy.shape)-1
print(X2,n,minDim)
print((freq_spark.count(), len(freq_spark.columns)))

1928.603337799327 5150 10
(100, 11)


This time we get the result of about 0.19, telling us that there is a weak association between `id` and `value`.

#### Errors with negative values
We also note that all frequencies must be positive. If there is a negative value in frequency table, we will get an error. 

To demonstrate, let's plant some negative values in `freq_numpy`, then calculate the statistic again.

In [None]:
freq_numpy[0][1] = -1
freq_numpy[10, 4] = -5

try:
    get_cramer_v(freq_numpy)
except ValueError as e:
    print(e)

This `ValueError` tells us we have negative values. Suppose we did not know the amount or where these negative values were located. We can index Numpy arrays in a similar way to Pandas DataFrames to see how many and where the negative values are located

In [None]:
# How many negative values
print(' Number of negative values:', len(freq_numpy[freq_numpy < 0]))

# Where are they located?
rows, cols = np.where(freq_numpy < 0)
print(' row of negative value(s):', rows,
      '\n column of negative value(s):', cols)



We can see that there is two negative values (shock!) and they are located at [0,1] and [10,4]. Again we use index our numpy array to extract the values at these two points:

In [None]:
freq_numpy[rows,cols]

### Further resources

Spark at the ONS Articles:


PySpark Documentation:
- [`.rand()`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.rand.html)
- [`.ceil()`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.ceil.html)
- [`.crosstab()`](https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.DataFrame.crosstab.html)

Python Documentation:
- [`subprocess`](https://docs.python.org/3/library/subprocess.html) 

sparklyr and tidyverse Documentation:
- [`sdf_coalesce()`](https://spark.rstudio.com/packages/sparklyr/latest/reference/sdf_coalesce.html)
- [`sdf_repartition()`](https://spark.rstudio.com/packages/sparklyr/latest/reference/sdf_repartition.html)
- [`sdf_num_partitions()`](https://spark.rstudio.com/packages/sparklyr/latest/reference/sdf_num_partitions.html)
- [`spark_apply()`](https://spark.rstudio.com/packages/sparklyr/latest/reference/spark_apply.html)

Spark Documentation:
- [Spark Configuration](https://spark.apache.org/docs/latest/configuration.html):
    - [Execution Behaviour](https://spark.apache.org/docs/latest/configuration.html#execution-behavior)
    - [Runtime SQL Configuration](https://spark.apache.org/docs/latest/configuration.html#runtime-sql-configuration)

In [54]:
df = spark.range(100)
df = df.union(df)
df = df.withColumn("value", F.ceil(F.rand(seed_no) * 10))
freq_spark = df.crosstab('id','value')
freq_spark = freq_spark.withColumn("id_value", F.col("id_value").cast(IntegerType()))
freq_pandas = freq_spark.toPandas()
freq_numpy = np.array(freq_pandas)

X2 = stats.chi2_contingency(freq_numpy, correction=False)[0]
n = np.sum(freq_numpy)
minDim = min(freq_numpy.shape)-1
print(X2,n,minDim)
freq_pandas_index = freq_pandas.set_index('id_value')
freq_numpy_index = np.array(freq_pandas_index)
print(stats.chi2_contingency(freq_numpy_index, correction=False)[3])
print(get_cramer_v(freq_numpy_index))

1928.603337799327 5150 10
[[0.19 0.13 0.15 0.21 0.15 0.16 0.2  0.22 0.27 0.32]
 [0.19 0.13 0.15 0.21 0.15 0.16 0.2  0.22 0.27 0.32]
 [0.19 0.13 0.15 0.21 0.15 0.16 0.2  0.22 0.27 0.32]
 [0.19 0.13 0.15 0.21 0.15 0.16 0.2  0.22 0.27 0.32]
 [0.19 0.13 0.15 0.21 0.15 0.16 0.2  0.22 0.27 0.32]
 [0.19 0.13 0.15 0.21 0.15 0.16 0.2  0.22 0.27 0.32]
 [0.19 0.13 0.15 0.21 0.15 0.16 0.2  0.22 0.27 0.32]
 [0.19 0.13 0.15 0.21 0.15 0.16 0.2  0.22 0.27 0.32]
 [0.19 0.13 0.15 0.21 0.15 0.16 0.2  0.22 0.27 0.32]
 [0.19 0.13 0.15 0.21 0.15 0.16 0.2  0.22 0.27 0.32]
 [0.19 0.13 0.15 0.21 0.15 0.16 0.2  0.22 0.27 0.32]
 [0.19 0.13 0.15 0.21 0.15 0.16 0.2  0.22 0.27 0.32]
 [0.19 0.13 0.15 0.21 0.15 0.16 0.2  0.22 0.27 0.32]
 [0.19 0.13 0.15 0.21 0.15 0.16 0.2  0.22 0.27 0.32]
 [0.19 0.13 0.15 0.21 0.15 0.16 0.2  0.22 0.27 0.32]
 [0.19 0.13 0.15 0.21 0.15 0.16 0.2  0.22 0.27 0.32]
 [0.19 0.13 0.15 0.21 0.15 0.16 0.2  0.22 0.27 0.32]
 [0.19 0.13 0.15 0.21 0.15 0.16 0.2  0.22 0.27 0.32]
 [0.19 0.13 0.15 0.2

In [23]:
df_pd = df.toPandas()
import pandas as pd
freq = pd.crosstab(df_pd['id'], df_pd['value'])
np.array(freq)


array([[0, 0, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 1, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [1, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 2],
       [0, 0, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 1],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 1, 0, 1, 0, 0, 0, 0],
       [0, 0, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 1, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 1, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0,

In [36]:
#load necessary packages and functions
import scipy.stats as stats
import numpy as np

#create 2x2 table
data = np.array([[7,12], [9,8]])

#Chi-squared test statistic, sample size, and minimum of rows and columns
X2 = stats.chi2_contingency(data, correction=False)[0]
n = np.sum(data)
minDim = min(data.shape)-1

#calculate Cramer's V 
V = np.sqrt((X2/n) / minDim)

#display Cramer's V
print(V)
stats.chi2_contingency(data, correction=False)

0.16174359558286786


(0.9417956656346751,
 0.33181646987425817,
 1,
 array([[ 8.44444444, 10.55555556],
        [ 7.55555556,  9.44444444]]))

In [59]:
import yaml
with open("../../../config.yaml") as f:
    config = yaml.safe_load(f)
    
# rescue_path = config["rescue_path_csv"]
rescue_path = "../../data/animal_rescue.csv"
rescue = spark.read.csv(rescue_path, header=True, inferSchema=True)
rescue = rescue.withColumnRenamed('AnimalGroupParent','animal_type')

In [73]:
rescue_crosstab = rescue.crosstab('CalYear','animal_type')

rescue_numpy = np.array(rescue_crosstab.toPandas().set_index('CalYear_animal_type'))
get_cramer_v(rescue_numpy)
X2 = stats.chi2_contingency(rescue_numpy, correction=False)[0]
n = np.sum(rescue_numpy)
minDim = min(rescue_numpy.shape)-1
print(X2,n,minDim)
expected = stats.chi2_contingency(rescue_numpy, correction=False)[3]


452.2733832747976 5898 10
[[-7.35164463e+00  2.04815192e-01  1.02407596e-01  1.90369617e+00
  -2.83146829e-01 -4.37368600e+00 -3.77314344e+00  5.12037979e-01
   2.04815192e-01 -4.62699220e+00  1.02407596e-01 -2.56629366e+00
   1.02407596e-01  4.28280773e+00  2.04815192e-01  3.07222787e-01
   4.09630383e-01 -9.75924042e-01 -5.90369617e-01 -1.80739234e-01
   1.60834181e+00  1.02407596e-01  3.07222787e-01  8.59003052e+00
   4.60834181e+00  6.32417769e-01  5.36113937e-01]
 [ 4.61851475e-01  2.04476094e-01  1.02238047e-01 -4.58952187e+00
  -2.28433367e+00  2.61037640e+00  3.05595117e+00 -4.88809766e-01
   2.04476094e-01  1.03326551e+01  1.02238047e-01  1.43133266e+00
   1.02238047e-01 -1.17441506e+01 -7.95523906e-01  3.06714140e-01
   4.08952187e-01  1.02238047e+00 -5.91047813e-01  8.17904374e-01
   6.00712106e-01  1.02238047e-01 -6.93285860e-01 -1.43743642e+00
   6.00712106e-01  1.62309257e+00 -1.46642930e+00]
 [-2.28585961e+00  1.22075280e-02  6.10376399e-03  1.75584944e+00
   4.27263479e

In [75]:
rescue_crosstab.show()
rescue.printSchema

+-------------------+----+------+----+---+---+----+---+------+----+---+----+-------+--------+-----+----+------+------+------+-----+-----+--------+--------+------------------------------------------------+--------------------------------+--------------------------------+---------------------+---+
|CalYear_animal_type|Bird|Budgie|Bull|Cat|Cow|Deer|Dog|Ferret|Fish|Fox|Goat|Hamster|Hedgehog|Horse|Lamb|Lizard|Pigeon|Rabbit|Sheep|Snake|Squirrel|Tortoise|Unknown - Animal rescue from water - Farm animal|Unknown - Domestic Animal Or Pet|Unknown - Heavy Livestock Animal|Unknown - Wild Animal|cat|
+-------------------+----+------+----+---+---+----+---+------+----+---+----+-------+--------+-----+----+------+------+------+-----+-----+--------+--------+------------------------------------------------+--------------------------------+--------------------------------+---------------------+---+
|               2016| 120|     0|   0|296|  1|  14|107|     0|   0| 29|   0|      4|       0|   12|   0|     

<bound method DataFrame.printSchema of DataFrame[IncidentNumber: string, DateTimeOfCall: string, CalYear: int, FinYear: string, TypeOfIncident: string, PumpCount: double, PumpHoursTotal: double, HourlyNotionalCost(£): int, IncidentNotionalCost(£): double, FinalDescription: string, animal_type: string, OriginofCall: string, PropertyType: string, PropertyCategory: string, SpecialServiceTypeCategory: string, SpecialServiceType: string, WardCode: string, Ward: string, BoroughCode: string, Borough: string, StnGroundName: string, PostcodeDistrict: string, Easting_m: double, Northing_m: double, Easting_rounded: int, Northing_rounded: int]>

In [76]:
rescue_crosstab = rescue.crosstab('PostcodeDistrict','animal_type')

rescue_numpy = np.array(rescue_crosstab.toPandas().set_index('PostcodeDistrict_animal_type'))
get_cramer_v(rescue_numpy)

0.28791093176153515