`cna` R package exemplification.
From *Identifying Complex Causal Dependencies in Configurational Data with Coincidence Analysis* (Baumgartner & Thiem, 2015).

Symbols:

- Conjunction: `*` or `&`
- Disjunction: `+` or `|`
- Negation: `-` or `!` or, in case of crisp-set or fuzzy-set data, by changing upper case into lower case letters and vice versa
- Implication: `->`
- Equivalence: `<->`

In [1]:
library(cna)

Registered S3 method overwritten by 'cna':
  method          from
  some.data.frame car 



## `d.educate` optimal hypothetical dataset

Load the example dataset `d.educate` from `cna`:

In [2]:
data(d.educate)
print(d.educate)

  U D L G E
A 1 1 1 1 1
B 1 1 1 0 1
C 1 0 1 1 1
D 1 0 1 0 1
E 0 1 1 1 1
F 0 1 1 0 1
G 0 0 0 1 1
H 0 0 0 0 0


The heart of the `cna` package is the `cna()` function.
It identifies and minimizes dependencies of sufficiency and necessity in the data.
The data passed to `cna()` can be in the form of a Boolean data frame or a truth table (as produced by the `truthTab()` function).

`truthTab()` merges multiple rows of a data frame featuring the same configuration into one row, such that each row of the resulting truth table corresponds to one determinate configuration.
The number of occurrences (cases) and an enumeration of the cases are saved as attributes ‘ n ’ and ‘ cases ’, respectively.
As Table 1 does not contain multiple rows with identical configurations, the application of truthTab() is uncalled for and we can directly pass d.educate on to cna(). 
Moreover, let us assume that we have no prior causal knowledge about the underlying causal structure, such that we cannot additionally supply a causal ordering.
The following is the default output returned by cna():

In [7]:
cna(d.educate)

--- Coincidence Analysis (CNA) ---

Factors: U, D, L, G, E 

Atomic solution formulas:
-------------------------
Outcome E:
 solution        consistency coverage complexity inus
 L + G <-> E               1        1          2 TRUE
 U + D + G <-> E           1        1          3 TRUE

Outcome L:
 solution    consistency coverage complexity inus
 U + D <-> L           1        1          2 TRUE

Complex solution formulas:
--------------------------
 outcome solution                        consistency coverage complexity inus
 E,L     (L + G <-> E)*(U + D <-> L)               1        1          4 TRUE
 E,L     (U + D + G <-> E)*(U + D <-> L)           1        1          5 TRUE

Consistency and coverage scores reasch maximal values for the atomic complex solution formulas, hence `d.educate` data are as good as configurational data can possibly get.

Notice how by the result on atomic formulas, we have *two* endogenous factors: $L$ and $E$ (but with two possible solutions for $E$, which account for the two complex solutions).
And that `cna()` infers that `d.educate` can be modeled in terms of the two complex structures represented in the complex solution formulas (which can be represented as causal graphs):

- $(L + G ←→ E)*(U + D ←→ L)$ represents a *causal chain*
- $(U + D + G ←→ E)*(U + D ←→ L)$ represents a *common cause* structure

As `d.educate` is optimal by all standards of configurational modeling, *there is no way of determining which of these two structures is the true or correct one*.


## `d.irrigate` real-world dataset

`d.irrigate`, included in `cna`, comes study by Lam and Ostrom (2010), who analyze the effects of an irrigation experiment in the course of development interventions on the Indrawati River watershed in the central hills of Nepal.
They investigate the causal relevance of five exogenous factors on “persistent improvement in water adequacy at the tail end in winter” ($W$), which takes the value 1 when farmers at the tail end of the watershed persistently receive the water they need in winter, and the value 0 otherwise.

The five exogenous factors are (for all of these, the values 1 and 0 represent “yes” and “no”, respectively):

- ($A$) “continual assistance on infrastructure improvement”
- ($R$) “existence of a set of formal rules for irrigation operation and maintenance”
- ($F$) “existence of provisions of fines”
- ($L$) “existence of consistent leadership”
- ($C$) “existence of collective action among farmers for system maintenance” 


In [2]:
data(d.irrigate)
d.irrigate

Unnamed: 0_level_0,A,R,F,L,C,W
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>
1,0,1,0,1,1,1
2,0,1,0,1,1,0
3,0,1,1,1,1,1
4,0,1,1,1,1,1
5,1,1,0,1,1,1
6,1,1,0,1,1,1
7,1,1,1,1,1,1
8,1,1,1,1,1,1
9,0,0,0,0,1,1
10,0,1,0,0,1,1


The authors assume that $W$ is the ultimate outcome of a causal structure, and this assumption can be passed to `cna()` using the `ing = <ordering>` parameter.
In this example, the intended ordering\
`ordering = list(c("A","R","F","L","C"),"W")`\
means that $W$ comes (causally) *after* all of the other factors, and hence it cannot be a cause of them (only an effect).

As it is often the case with real-world data, this does *not* comprise all relevant factors for $W$, hence it is not possible to reach perfect ($1$) coverage scores.
We need thus to lower the coverage threshold, and we will do so to $0.9$:

In [19]:
# We can use the function `csf()` to output only the complex solution formulas
sol1 <- cna(d.irrigate, ordering=list(c("A","R","F","L","C"),"W"), cov=0.9)
csf(sol1)

Unnamed: 0_level_0,outcome,condition,consistency,coverage,complexity,inus
Unnamed: 0_level_1,<otcmStrn>,<stdCmplx>,<dbl>,<dbl>,<int>,<lgl>
1,"C,W",(a + L + R*f <-> C)*(A*R + a*l + R*F <-> W),1,0.9166667,10,True
2,"C,W",(a + L + R*f <-> C)*(A*R + R*F + l*C <-> W),1,0.9166667,10,True
3,"C,W",(a + L + R*f <-> C)*(A*L + R*F + l*C <-> W),1,0.9166667,10,True
4,"C,W",(a + L + R*f <-> C)*(a*l + A*C + R*F <-> W),1,0.9166667,10,True
5,"C,W",(a + L + R*f <-> C)*(A*C + R*F + l*C <-> W),1,0.9166667,10,True
6,"C,W",(a + L + R*f <-> C)*(A*L + a*l + R*F + R*l <-> W),1,0.9166667,12,True
7,"C,W",(a + L + R*f <-> C)*(A*R + R*F + R*l + a*r*f <-> W),1,0.9166667,13,True
8,"C,W",(a + L + R*f <-> C)*(A*R + R*F + R*l + r*f*C <-> W),1,0.9166667,13,True
9,"C,W",(a + L + R*f <-> C)*(A*L + R*F + R*l + a*r*f <-> W),1,0.9166667,13,True
10,"C,W",(a + L + R*f <-> C)*(A*L + R*F + R*l + r*f*C <-> W),1,0.9166667,13,True


With this result, note that, in contrast with the author's assumptions, not only $W$ can be modeled as an endogenous factor; also $C$ is being modeled as such.
The reason is that this dataset was analyzed with QCA with its focus on single-outcome structures, while CNA returns all possible formulas that fare equally well at the parameters of model fit.
In this result, the behaviour of $C$ and $W$ is regulated by a common cause structure.

We may also generate models for negastive outcomes, with the `cna()` option `notcols=c()` which takes a character vector of factors to be negated as input (these factors must also appear negatively in the ordering).
Let's try, for example, negating $C$ and $W$ (and lowering the threshold so we can obtain a result):

In [36]:
sol2 <- cna(d.irrigate, ordering=list(c("A","R","F","L","c"),"w"),
           notcols=c("C","W"),cov=0.66)

In [40]:
# nsolutions just defines how many formulas to visually show
print(sol2, nsolutions=3, what="a,c")

--- Coincidence Analysis (CNA) ---

Causal ordering:
A, R, F, L, C < W

Atomic solution formulas:
-------------------------
Outcome R:
 solution              consistency coverage complexity inus
 A*F + f*L <-> R                 1    0.667          4 TRUE
 A*C + f*L <-> R                 1    0.667          4 TRUE
 A*L + f*L + F*l <-> R           1    0.667          6 TRUE

Outcome w:
 solution        consistency coverage complexity inus
 A*r + r*F <-> w           1    0.667          4 TRUE
 A*r + r*L <-> w           1    0.667          4 TRUE
 r*F + r*c <-> w           1    0.667          4 TRUE
 ... (total no. of formulas: 6)

Complex solution formulas:
--------------------------
 outcome solution                            consistency coverage complexity
 R,w     (A*F + f*L <-> R)*(A*r + r*F <-> w)           1    0.667          8
 R,w     (A*F + f*L <-> R)*(A*r + r*L <-> w)           1    0.667          8
 R,w     (A*F + f*L <-> R)*(r*F + r*c <-> w)           1    0.667          8
 i

The `condition()` function provides assistance to inspect the properties of sufficient and necessary conditions in a data frame; notably, those that appear in solution formulas returned by `cna()`.
To work with it, we input a vector of strings specifying Boolean functions as input, and will output (i) the configurations and cases that instantiate a given condition or solution, and (ii) the consistency and coverage:

In [43]:
condition("A*r + F*r <-> w", d.irrigate)

A*r + F*r <-> w :
type of condition: atomic 
    A*r+F*r w | n.obs
1         0 0 |     1
2         0 1 |     1
3,4       0 0 |     2
5,6       0 0 |     2
7,8       0 0 |     2
9         0 0 |     1
10        0 0 |     1
11        1 1 |     1
12        1 1 |     1
13        0 0 |     1
14        0 0 |     1
15        0 0 |     1
Consistency: 1.000 (2/2)
Coverage:    0.667 (2/3)
Total no. of cases: 15


In this case, we see how the disjunction $A*r + F*r$ covers the instances 11 and 12, leaving the occurrence of $w$ in case 2 uncovered.
Consequently, the overall solution coverage is 2/3.

We may also perform a manual calculation of *unique coverage scores* (Ragin, 2008) using the `summary()` function.
As we will negate one of the disjucts, we turn the disjunct into a conjunction, and in this example we find out that each of these two disjuncts uniquely covers one of the instances of $w$ (i.e., the negation of $W$):

In [54]:
summary(condition("A*r * -(F*r) <-> w", d.irrigate))

A*r*-(F*r) <-> w :
type of condition: atomic 
Consistency: 1.000 (1/1)
Coverage:    0.333 (1/3)
Total no. of cases: 15



In [55]:
summary(condition("F*r * -(A*r) <-> w", d.irrigate))

F*r*-(A*r) <-> w :
type of condition: atomic 
Consistency: 1.000 (1/1)
Coverage:    0.333 (1/3)
Total no. of cases: 15

