## Calculating Cramér's V from a Spark DataFrame

### What is Cramér's V?
Cramér's V is a statistical measure of an association between two nominal variables, giving a value between 0 and 1 inclusive. Here 0 would indicate no association and 1 indicates a strong association between the two variables. It is based on Pearson's chi-square statistics.

We calculate Cramér's V as follows:

$$ \text{Cramer's V} = \sqrt{\dfrac{\dfrac{\chi^2}{n}}{\min (c-1,r-1)}}, $$ 
where:
- $\chi^2$ is the Chi-squared statistic,
- $n$ is the number of samples,
- $r$ is the number of rows,
- $c$ is the number of columns.

In some literature you may see the Phi coefficient used ($\phi$), where $\phi^2 = \chi^2/n$.

### Cramér's V in Spark:
Although there is not an in built method for calculating this statistic in base python, is it reasonably straightforward using `numpy` and `scipy.stats` packages. An example of this can be found [online here](https://www.statology.org/cramers-v-in-python/).
A similar example for R is linked [here](https://www.statology.org/cramers-v-in-r/).

To calculate the Cramér V statistic, we will need to first calculate the $\chi^2$ statistic. In python we will utilise `scipy.stats.chi2_contingency` and `chisq.test` in R. Both these functions will take a matrix like input of a contingency table / pair-wise frequency table. Both Pyspark and SparklyR have inbuilt functions which can produce these tables (`crosstab`/`sdf_crosstab`) as we will see shortly.

Due to Pyspark and SparklyR's differences to classical python and R, we need to consider how we can calculate Cramér's V when using Spark DataFrames.
First we will import the needed packages, start a spark session and load the rescue data. 

In [20]:
import yaml
import numpy as np
from pyspark.sql import SparkSession, functions as F
import scipy.stats as stats


spark = (SparkSession.builder.master("local[2]")
         .appName("cramer-v")
         .getOrCreate())


with open("../../../config.yaml") as f:
    config = yaml.safe_load(f)
    
# rescue_path = config["rescue_path_csv"]
# rescue_path = config["rescue_clean_path"]
# rescue = spark.read.parquet(rescue_path)
rescue_path = "../../data/animal_rescue.csv"
rescue = spark.read.csv(rescue_path, header=True, inferSchema=True)
rescue = (rescue.withColumnRenamed('AnimalGroupParent','animal_type')
                .withColumnRenamed('CalYear', 'cal_year')
                .withColumnRenamed('PostcodeDistrict','postcode_district')
          )


```R
library(sparklyr)
library(dplyr)

default_config <- sparklyr::spark_config()

sc <- sparklyr::spark_connect(
    master = "local[2]",
    app_name = "cramer-v",
    config = default_config)

config <- yaml::yaml.load_file("ons-spark/config.yaml")
rescue <- sparklyr::spark_read_parquet(sc, config$rescue_clean_path, header=TRUE, infer_schema=TRUE)
rescue <- rescue %>% dplyr::rename(
                    animal_type = animal_group,
                    postcode_district = postcodedistrict)

```

As Cramér V is a measure of how two variables are associated, it makes sense for us to select two variables which we either believe will or will not have some level of association. For our first example we will select the `cal_year` and `animal_type` columns. following this we will compare `postcode_district` and `animal_type`.

#### Cramér's V Example 1: `cal_year` and `animal_type`

Using either `.crosstab()` or `sdf_crosstab()`, we can calculate a pair-wise frequency table of the `id` and `value` columns (a.k.a. contingency table). We will generate this table and convert it to a pandas DataFrame.

In [13]:
freq_spark = rescue.crosstab('cal_year','animal_type')
freq_pandas = freq_spark.toPandas()
freq_pandas.head()

Unnamed: 0,cal_year_animal_type,Bird,Budgie,Bull,Cat,Cow,Deer,Dog,Ferret,Fish,...,Rabbit,Sheep,Snake,Squirrel,Tortoise,Unknown - Animal rescue from water - Farm animal,Unknown - Domestic Animal Or Pet,Unknown - Heavy Livestock Animal,Unknown - Wild Animal,cat
0,2016,120,0,0,296,1,14,107,0,0,...,2,1,1,3,0,0,8,0,5,1
1,2012,112,0,0,302,3,7,100,1,0,...,0,1,0,4,0,1,18,4,4,3
2,2019,9,0,0,16,0,2,1,0,0,...,0,0,0,1,0,1,3,0,0,0
3,2017,124,0,0,257,0,11,81,0,0,...,0,0,1,5,0,0,8,0,7,1
4,2014,110,0,0,295,1,5,90,0,0,...,3,0,0,2,0,1,29,0,7,3


```r
freq_spark <- sdf_crosstab(rescue, 'cal_year', 'animal_type') 
glimpse(freq_spark)
```

Now that we have converted out data into a contingency table, we need to be careful about converting this into a matrix or array type variable. If we do this without considering the `CalYear_animal_type` column, we will end up with some string within our matrix. We would spot this issue when moving forward to use `.chi2_contingency()`, as we would get a `TypeError` raised. We need to find some way of dealing with our column, such that it is not converted when changing our pandas DataFrame into a numpy array. We do this by setting the `cal_year_animal_type` column as the index of our pandas dataframe, as neither the index or column headers are converted into a numpy array, just the internal values are extracted.

A type issue is also raised in R, however we will deal with this slightly differently. Instead of changing the index of our DataFrame, we will simply drop the `cal_year_animal_type` column. This allows us to pass the frequency table directly into the `chisq.test()` function.

In [14]:
freq_pandas_reindexed = freq_pandas.set_index('cal_year_animal_type')
freq_numpy = np.array(freq_pandas_reindexed)


```R
freq_r <- freq_spark %>% collect() 
freq_r <- subset(freq_r, select = -cal_year_animal_type)

```

As we are wanting to calculate Cramér's V for two examples, we will define a function to perform the $\chi^2$ test and calculate the statistic. We only need the $\chi^2$ statistic, as such we extract this from the `chi2_contingency`/`chisq.test` functions using `[0]` in python and `$statistic` in R. `chisq.test` has additional text included in the output, to remove this we cast this output as a double, keeping only the numerical value.

In [19]:
def get_cramer_v(freq_numpy):
    # Chi-squared test statistic, sample size, and minimum of rows and columns
    X2 = stats.chi2_contingency(freq_numpy, correction=False)[0]
    n = np.sum(freq_numpy)
    minDim = min(freq_numpy.shape)-1

    #calculate Cramer's V 
    V = np.sqrt((X2/n) / minDim)
    return V

```r
get_cramer_v <- function(freq_r){
  # Chi-squared test statistic, sample size, and minimum of rows and columns
  X2 <- chisq.test(freq_r, correct = FALSE)$statistic
  X2 <- as.double(X2)
  n <- sum(freq_r)
  minDim <- min(dim(freq_r)) - 1

  # calculate Cramer's V 
  V <- sqrt((X2/n) / minDim)
  return(V)
}
```

Following the preprocessing and consideration of column names, we can now apply the `get_cramer_v()` function to our arrays 

In [16]:
get_cramer_v(freq_numpy)

0.08756854441378689

```r
get_cramer_v(freq_r)
```

We now get an Cramér V of 0.087, telling us that there is little to no association between `cal_year` and `animal_type`.
This example also validates both Python and R methods against each other, as we get identical outputs.

#### Cramér's V Example 2: `postcode_district` and `animal_type`

Using the earlier defined function, just need to repeat the cross tabbing exercise before applying our function. This time we will select `postcode_district` instead of `cal_year`.

In [21]:
freq_spark = rescue.crosstab('postcode_district','animal_type')
freq_pandas = freq_spark.toPandas()
freq_pandas_reindexed = freq_pandas.set_index('postcode_district_animal_type')
freq_numpy = np.array(freq_pandas_reindexed)
get_cramer_v(freq_numpy)

0.28791093176153515

```r
freq_spark <- sdf_crosstab(rescue, 'postcode_district', 'animal_type') 
freq_r <- freq_spark %>% collect() 
freq_r <- subset(freq_r, select = -postcode_district_animal_type)
get_cramer_v(freq_r)

```

This time we get a cramér V value of 0.28, suggesting a slight association between `postcode_district` and `animal_type`.

### Potential Issue with `sdf_crosstab()`
During the testing for this page, we noted a few common issues which may arise when attempting to calculate Cramér's V statistic. Specifically this issue is linked to `sdf_crosstab()` function for data which has not been fully pre processed.

For this example we take the unprocessed rescue dataset. Here in `animal_type`, we have values for cat with both a upper and lower case first letter. Here we will use the `tryCatch()` function to handle the error message, this is similar to pythons `try: except:` statements (For more info see [`Error handing in R`](https://cran.r-project.org/web/packages/tryCatchLog/vignettes/tryCatchLog-intro.html) or [`Errors and Exceptions (Python)`](https://docs.python.org/3/tutorial/errors.html)).



```r
rescue_raw <- sparklyr::spark_read_csv(sc, config$rescue_path_csv, header=TRUE, infer_schema=TRUE)
rescue_raw <- rescue_raw %>%
    dplyr::rename(
        incident_number = IncidentNumber,
        animal_type = AnimalGroupParent,
        cal_year = CalYear)

sdf_crosstab(rescue_raw, 'cal_year', 'animal_type') 

tryCatch(
  {
    sdf_crosstab(rescue_raw, 'cal_year', 'animal_type')
  },
  error = function(e){
    message('Error message from Spark')
    print(e)
  }
)
```

From the error message, we can see that `Cat` is ambigious. When we look closer at the distinct values within the `animal_type` column, we will see there is both `cat` and `Cat` present. This is something to be aware of if you wish to use `sdf_crosstab` in the future.

```r
rescue_raw %>% sparklyr::select(animal_type) %>% distinct()
```

Python does not encounter this issue, and views the case difference between the groups separately. 

### Further resources

PySpark Documentation:
- [`.crosstab()`](https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.DataFrame.crosstab.html)

Python Documentation:
- [`chi2_contingency()`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html)
- [`array()`](https://numpy.org/doc/stable/reference/generated/numpy.array.html)
- [`Errors and Exceptions`](https://docs.python.org/3/tutorial/errors.html)

sparklyr Documentation:
- [`sdf_crosstab()`](https://www.rdocumentation.org/packages/sparklyr/versions/1.3.1/topics/sdf_crosstab)

R Documention:
- [`chisq.test()`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/chisq.test)
- [`Error handing in R`](https://cran.r-project.org/web/packages/tryCatchLog/vignettes/tryCatchLog-intro.html)
