<div align="center"><font size="5"><b>Logical Connectives<br>in the UK Statute Book</b></font></div><br>

<div align="center" style="background-color:#000000; color:white;"><font size="5"><b>Contents</b></font></div>

1. [Hypothesis](#1)
2. [An overview of data sources](#2)<br>
    2.1 [The legislation.gov.uk API](#2.1)<br>
3. [Data quality and pre-processing](#3)<br>
    3.1 [Wrangling data](#3.1)<br>
    3.2 [Logical operator definitions](#3.2)<br>
4. [Results from investigative analysis of data](#4)<br>
    4.1 [Creating a training set](#4.1)<br>
    4.2 [Creating a test set](#4.2)<br>
5. [Model development and testing](#5)<br>
    5.1 [Training the model and hyperparameter Tuning](#5.1) <br>
    5.2 [Train and evaluate](#5.2)<br>
6. [Overview of testing results and final model selection](#6)<br>
    6.1 [Evaluation scores](#6.1)<br>
    6.2 [Saving the best model](#6.2)<br>
    6.3 [Results visualisation](#6.3)<br>
7. [Hypothesis test and conclusion](#7)<br>
    7.1 [Analytics and inferences](#7.1)<br>
    7.2 [Statistcal differences](#7.2)<br>
    7.3 [Conclusion](#7.3)

<div align="center" style="background-color:#000000; color:white;"><font size="5"><b>Abstract</b></font></div>

<font size="3">This demonstration trains a custom Named-Entity Recognition (NER) model using a convolutional neural network to classify text from a set of logical connectives for use on the UK statute book corpus. The model is used to make inferences on three popular statute documents that are obtained from the www.legislation.gov.uk API and pre-processed. The results are compared to establish whether a significant difference exists between them in terms of thier logical connective complexity.</font>

<a id = "1"></a><br><div align="center" style="background-color:#000000; color:white;"><font size="5"><b>1. Hypothesis to test the degree of difference between logical connectives in statutes</b></font></div>

In [None]:
<font size="3">Logical connectives are truth functional with symbolic notation supported by context-free language $L_{1}$. The language can be used to teach an NER learning machnine to classify logical connectives</font>


| $L_{1}$-connective      | Word                | Term                         |          
| ----------------------- | ------------------- | ---------------------------- |
| P$\wedge$Q              | and                 | Conjunction                  |
| P$\vee$Q                | or                  | Disjunction                  |
| P$\neg$Q                | not                 | Negation                     |
| P$\rightarrow$Q         | if...then           | Material implication         |
| P$\leftrightarrow$Q     | if and only if      | Bi-conditional               |


<font size="3">The constituent parts of a statement may be expressed formally with $L_{1}$-connective sentence letters as
with the legislative statement the ‘[Lords Spiritual] <b>and</b> [Temporal]’ giving:</font><br><br>

$$(\mathbf P_0 \wedge \mathbf Q_0)$$  $\tag{1.1}$

<font size="3">Let $c$ be the number of occurrences for a type $i$ of logical connective in $L_1$ from a class of statute $j$ divided by the number of its words $w$ such that the quotient $q$ is obtained:</font><br><br>

$$\begin{equation}
q_{i,j}=\frac{c_i}{w_j}
\end{equation}$$ $\tag{1.2}$

<font size="3">Thus, each q for $i$ under $j$ can be tabulated $\begin{Bmatrix} q_{i,j} \end{Bmatrix} _{n \times k}$ to produce a dataset for deriving an overall $L_{1}$-connective index or Freidman $F_r$ statistic:</font><br><br>

$$Q =
\begin{pmatrix}
    q_{11}       & q_{12} & q_{13} & \dots & q_{1n} \\
    q_{21}       & q_{22} & q_{23} & \dots & q_{2n} \\
    q_{k1}       & q_{k2} & q_{k3} & \dots & q_{kn}
\end{pmatrix}$$  $\tag{1.3}$

<font size="3">Now, to calculate an $L_{1}$-connective index for a statute $j$, let all of its column quotients be summed:</font><br><br>

$$\sum_{j=1}^{k} q_j=q_{1} + q_{2} + \cdots + q_{k}$$ $\tag{1.4}$

<font size="3">Furthermore, we obtain $F_r$ after adding rank column $k$ blocks for each treatment row $n$ with the sums named $T$:</font><br><br>

$$F_r = \frac{12}{nk(k+1)}(T_{1}^{2} + T_{2}^{2}+\cdots + T_{k}^{2})-3n(k+1)$$ $\tag{1.5}$

<font size="3">The hypotheses then test whether there is a statistical significance between three statutes formed by a dataset $Q$ on an alpha $α$ significance level of 0.05, using probability value $(p)$ under a Friedman testto conclude either:<font size="3"><br><br>

<font size="3"><b>$H_{0}$: There is no statistically significant difference in a datset of three groups of statute data in $Q$ such that $p$ > $α$</b></font><br><br>
<font size="3"><b>$H_{1}$: There is a statistically significant difference in a datset of three groups of statute data in $Q$ such that $p$ ≤ $α$</b></font>
    
<font size="3">That is, the null hypothesis $H_{0}$ can be rejected if $p$ ≤ α because the comparative results would show more than a 5% chance of a difference being present where none in fact exists.</font>

<a id = "2"></a><br>
<div align="center" style="background-color:#000000; color:white;"><font size="5"><b>2. The Data Source</b></font></div>

In [None]:
!pip install spacy==2.3.5
!pip install spacy-lookups-data
!python -m spacy download en_core_web_sm
!python -m spacy download en

<a id = "2.1"></a><br><div align="left"><font size="5"><b>2.1 The legislation.gov.uk API</b></font></div><br>
<font size="3">The model is trained with Law of Property Act 1925 as it offers a wide range of logical connective terms</font><br>

In [None]:
import requests

data = requests.get("https://www.legislation.gov.uk/ukpga/Geo5/15-16/20/enacted/data.html") #Law of Property Act (1925)
content = data.content
print(content) # HTML tags present

<a id = "3"></a><br>
<div align="center" style="background-color:#000000; color:white;"><font size="5"><b>3. Data quality and pre-processing</b></font></div>

<div align="left"><font size="5"><b>3.1 Wrangling data</b></font></div><a id = "3.1"></a>

In [None]:
!pip install bs4

In [None]:
import re
from bs4 import BeautifulSoup
# simplified version
#soup = BeautifulSoup(content)
#clean_content = soup.get_text()
#print(clean_content[500:5000])


# verbose version
def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    [s.extract() for s in soup(['iframe', 'script'])]
    stripped_text = soup.get_text()
    stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)
    return stripped_text

clean_content = strip_html_tags(content)
print(clean_content[500:5000])

<a id = "3.2"></a><br>

<div align="left"><font size="5"><b>3.2 Logical operator definitions</b></font></div>

In [None]:
#Import the requisite library
import spacy

#Build upon the spaCy Small Model
nlp = spacy.blank("en")


#Sample text
text = clean_content

#Create the EntityRuler
ruler = nlp.create_pipe("entity_ruler")

#List of Entities and Patterns
ruler.add_patterns([{"label": "(P ∧ Q)", "pattern": "and"},{"label": "(P ∧ Q)", "pattern": "but"},{"label": "(¬ P)", "pattern": "not"},#disregarding material nonimplication"↛"
                    {"label": "(P ∨ Q)", "pattern": "or"},{"label": "(P ← Q)", "pattern": "if"},{"label": "(P ↓ Q)", "pattern": "nor"},#"neither" proceeding "nor"
                    {"label": "(P → Q)", "pattern": "so"},{"label": "(P → Q)", "pattern": "therefore"}])
nlp.add_pipe(ruler)

doc = nlp(text)

#extract entities
for ent in doc.ents[107:180]:
   print (ent.text, ent.label_)


<a id = "4"></a><br>

<div align="center" style="background-color:#000000; color:white;"><font size="5"><b>4. Results from investigative analysis of data</b></font></div>

<div align="left"><font size="5"><b>4.1 Creating a training set</b></font></div><a id = "4.1"></a>

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
text = clean_content
corpus = []

doc = nlp(text)
for sent in doc.sents:
    corpus.append(sent.text)

nlp = spacy.blank("en")

ruler = nlp.create_pipe("entity_ruler")

patterns = [{"label": "(P ∧ Q)", "pattern": "and"},{"label": "(P ∧ Q)", "pattern": "but"},{"label": "(¬ P)", "pattern": "not"},#disregarding material nonimplication"↛"
                    {"label": "(P ∨ Q)", "pattern": "or"},{"label": "(P ← Q)", "pattern": "if"},{"label": "(P ↓ Q)", "pattern": "nor"},#"neither" proceeding "nor"
                    {"label": "(P → Q)", "pattern": "so"},{"label": "(P → Q)", "pattern": "therefore"}]

ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

TRAIN_DATA = []
for sentence in corpus:
    doc = nlp(sentence)
    entities = []

    for ent in doc.ents:
        entities.append([ent.start_char, ent.end_char, ent.label_])
    TRAIN_DATA.append([sentence, {"entities": entities}])

print (TRAIN_DATA[618:640]) #view sample labelled training data LPA (1925)

<a id = "4.2"></a><br>

<div align="left"><font size="5"><b>4.2 Creating a test set</b></font></div><br>
<font size="3">TRAINING DATA is labelled LPA 1925 data and TEST DATA is validation LPA 1925 data.</font><br>

In [None]:
import spacy
import random
import time
import warnings
from spacy.util import minibatch, compounding, decaying
from spacy.gold import GoldParse
from spacy.scorer import Scorer

# Test data is validation data

TEST_DATA = TRAIN_DATA
#TEST_DATA = [['Parliament (Qualification of Women) Act 1918http://www.legislation.gov.uk/ukpga/Geo5/8-9/47/enactedParliament (Qualification of Women)', {'entities': []}], ["Act 1918Members Of ParliamentQueen's Printer of Acts of Parliament2021-02-05Parliament", {'entities': []}], ['(Qualification of Women)', {'entities': []}], ['Act', {'entities': []}], ['19181918 Chapter 47An Act to amend the Law with respect to the Capacity of Women to sit in Parliament.[21st November 1918]Be', {'entities': []}], ["it enacted by the King's most Excellent Majesty, by and with the advice and consent of the Lords Spiritual and Temporal, and Commons, in this present Parliament assembled, and by the authority of the same, as follows:1Capacity of women to be members of Parliament.", {'entities': [[52, 55, '(P ∧ Q)'], [72, 75, '(P ∧ Q)'], [107, 110, '(P ∧ Q)'], [121, 124, '(P ∧ Q)'], [172, 175, '(P ∧ Q)']]}], ['A woman shall not be disqualified by sex or marriage for being elected to or sitting or voting as a Member of the Commons House of Parliament.2Short title.', {'entities': [[14, 17, '(¬ P)'], [41, 43, '(P ∨ Q)'], [74, 76, '(P ∨ Q)'], [85, 87, '(P ∨ Q)']]}], ['This Act may be cited as the Parliament (Qualification of Women) Act, 1918.', {'entities': []}]]

random.seed(0)

# Log files for logging the train and testing scores for references
file = open('output_log.txt','w') 
file.write("iteration_no" + "," + "losses" +"\n")

file1 = open('test_output.txt','w')
file1.write("iteration_no"+ "," +"ents_p"+ "," +"ents_r"+ "," +"ents_f"+ "," +"ents_per_type"+ "\n")

file2 = open('train_output.txt','w')
file2.write("iteration_no"+ "," +"ents_p"+ "," +"ents_r"+ "," +"ents_f"+ "," +"ents_per_type"+ "\n")

model = None # ("en_core_web_sm")   # Replace with model to train
start_training_time = time.time()

print("Test set created")

<a id = "5"></a><br>
<div align="center" style="background-color:#000000; color:white;"><font size="5"><b>5. Model development and testings</b></font></div>

<font size="3">The training process subscribes to the training pipline taken from spaCy.io below: </font><br><br>

<p><center><img style="float: top;margin: max-width:1000px" src="https://spacy.io/training-73950e71e6b59678754a87d6cf1481f9.svg"></center></p>

<div align="left"><font size="5"><b>5.1 Training the model and hyperparameter tuning</b></font></div><a id = "5.1"></a><br>
<font size="3">The following hyperparameters are tuned  : </font><br><br>

| Hyperparameter          | Function                                          | Tuned                      |          
| ----------------------- | --------------------------------------------------|----------------------------|
| dropout_from            | Initial dropout rate                              | 0.8                        |
| dropout_decay           | Rate of dropout change 0 = unlimited              | 1e-6                       |
| token_vector_width      | Width of embedding tables and convolutional layers| 256                        |
| Conv_depth              |	Depth of the tok2vec layer                        | 16                         |

In [None]:
def train_spacy(TRAIN_DATA, iterations):

    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")

   # TRAIN_DATA = data

    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last=True)
    
    else:
        ner = nlp.get_pipe("ner")

    # add labels
    for _, annotations in TRAIN_DATA:
         for ent in annotations.get('entities'):
            ner.add_label(ent[2])

    if model is None:
        optimizer = nlp.begin_training()

        # For training with customized cfg 
        nlp.entity.cfg['conv_depth'] = 16
        nlp.entity.cfg['token_vector_width'] = 256
        # nlp.entity.cfg['bilstm_depth'] = 1
        # nlp.entity.cfg['beam_width'] = 2


    else:
        print ("resuming")
        optimizer = nlp.resume_training()
        print (optimizer.learn_rate)
    
    # get names of other pipes to disable them during training
    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    
    dropout = decaying(0.8, 0.2, 1e-6) #minimum, max, decay rate
    sizes = compounding(1.0, 4.0, 1.001)

    with nlp.disable_pipes(*other_pipes):  # only train NER
        
        warnings.filterwarnings("once", category=UserWarning, module='spacy')

        for itn in range(iterations):
            
            file = open('outputlog.txt','a') # For logging losses of iterations 
            
            start = time.time() # Iteration Time
            
            if(itn%100 == 0):
                print("Itn  : "+str(itn), time.time()-start_training_time)
                print('Testing')
               
                results = evaluate(nlp, TEST_DATA)
                file1 = open('test_output.txt','a') 
                file1.write(str(itn)+','+ str(results['ents_p'])+','+str(results['ents_r'])+','+str(results['ents_f'])+','+str(results["ents_per_type"])+"\n")
                file1.close()

                results = evaluate(nlp, TRAIN_DATA)
                file2 = open('train_output.txt','a') 
                file2.write(str(itn)+','+ str(results['ents_p'])+','+str(results['ents_r'])+','+str(results['ents_f'])+','+str(results["ents_per_type"])+"\n")
                file2.close()

                modelfile = "training_model"+str(itn)
                nlp.to_disk(modelfile)
  
            # Reducing Learning rate after certain operations 
            if (itn == 500):
                optimizer.learn_rate = 0.0001 
    
            print("Starting iteration " + str(itn))
            random.shuffle(TRAIN_DATA)
            losses = {}

            # use either batches or entire set at once

            ##### For training in Batches
            batches = minibatch(TRAIN_DATA, size=sizes)
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, drop=next(dropout), losses=losses)

            ###########################################

            ##### For training in as a single iteration
            
            #for text, annotations in TRAIN_DATA:
                 #nlp.update(
                         #[text],  # batch of texts
                         #[annotations],  # batch of annotations
                         #drop=0.2,  # dropout - make it harder to memorise data
                         ## drop=next(dropout),  Incase you are using decaying drop
                         #sgd=optimizer,  # callable to update weights
                         #losses=losses)


            print("Losses",losses)
            file.write(str(itn) + "," + str(losses['ner']) +"\n")
            print ("time for iteration:", time.time()-start)
            file.close()

    return nlp

print("Training code and hyperparameters set")

<div align="left"><font size="5"><b>5.2 Train and evaluate</b></font></div><a id = "5.2"></a><br>
<font size="4"> The number of epochs is set to 10 for this demonstration.</font><br>
<font size="4"> The video will skip the start and resume from 9 iterations to shorten time.</font>

In [None]:
def evaluate(ner_model, test_data):
    scorer = Scorer()
    for input_, annot in test_data:
        doc_gold_text = ner_model.make_doc(input_)
        gold = GoldParse(doc_gold_text, entities=annot['entities'])
        pred_value = ner_model(input_)
        scorer.score(pred_value, gold)
    return scorer.scores
print("Evaluation class created")

In [None]:
prdnlp = train_spacy(TRAIN_DATA, 10) #testing 20 epochs

<a id = "6"></a><br>

<div align="center" style="background-color:#000000; color:white;"><font size="5"><b>6. Overview of testing results and final model selection</b></font></div>

<div align="left"><font size="5"><b>6.1 Evaluation scores</b></font></div><a id = "6.1"></a>

In [None]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.pyplot import figure
matplotlib.style.use('ggplot')
figure(num=None, figsize=(9, 7), dpi=80, facecolor='w', edgecolor='k')

data = np.genfromtxt('./outputlog.txt', delimiter=",", names=["x", "y"])
plt.title('Training performance')
plt.xlabel("Iteration")
plt.ylabel("Loss")

plt.plot(data['x'], data['y'], 'r')
plt.xticks(range(0,20))
#plt.savefig('line_plot.pdf') 
# https://stackoverflow.com/questions/52229875/how-to-force-matplotlib-to-show-values-on-x-axis-as-integers


In [None]:
# test_text = input("Enter your testing text: ")
# doc = prdnlp(test_text)
# for ent in doc.ents:
#     print(ent.text, ent.start_char, ent.end_char, ent.label_)

# Prints Final -- f1 score, precision and recall
results = evaluate(prdnlp, TEST_DATA)
import json
print (json.dumps(results,indent=4))

<div align="left"><font size="5"><b>6.2 Saving the best model and testing a statute</b></font></div><a id = "6.2"></a>

In [None]:
!pip install bs4

<font size="3">Using the Employment Rights Act (ERA) 1996 as test data</font>

In [None]:
import requests

statute = requests.get("https://www.legislation.gov.uk/ukpga/1996/18/enacted/data.html") #Employment Rights Act 1996
content = statute.content
from bs4 import BeautifulSoup

soup = BeautifulSoup(content)
clean_statute = soup.get_text()
print(clean_statute[50000:52500])

<font size="3">Make inferences on the ERA 1996</font>

In [None]:
#inference - cross-validation on unseen data

text = clean_statute
doc_ERA_data = prdnlp(text)

for ent in doc_ERA_data.ents:
    print (ent.text, ent.label_)

In [None]:
#Count the entities 

# Check other text that might have been incorretly counted

from collections import Counter 
  
# Create a list 
z = [(ent.text, ent.label_) for ent in doc_ERA_data.ents]
col_count = Counter(z) 
print(col_count) 


In [None]:
# Save trained Model

# uncomment if model name through command line
# modelfile = input("Enter your Model Name: ")
modelfile = "Final_model"
prdnlp.to_disk(modelfile)
print("Saved model to", modelfile)

<div align="left"><font size="5"><b>6.3 Results visualisation</b></font></div><a id = "6.3"></a>

In [None]:
from spacy import displacy
colors = {"(P ∧ Q)":"lightblue", "(¬ P)":"lightgreen","(P ∨ Q)":"orange","(P ← Q)":"purple","(P ↓ Q)":"light red","(P → Q)":"grey"}
options = {"ents": ["(P ∧ Q)", "(¬ P)", "(P ∨ Q)", "(P ← Q)", "(P ↓ Q)", "(P → Q)"], "colors": colors}
displacy.render(doc_ERA_data[15000:20000], style='ent', jupyter=True, options=options)

In [None]:
!pip install py-readability-metrics
!python -m nltk.downloader punkt

<font size="3">Show inference figures</font>

In [None]:
ERA_NEG_count = 0

# Iterate over all the entities
for ent in doc_ERA_data.ents:
    if ("(¬ P)" in ent.label_):  # isues counting (¬ P) when 1
        # Increment count
        ERA_NEG_count += 1
    
ERA_DSJ_count = 0

# Iterate over all the entities
for ent in doc_ERA_data.ents:
    if ("(P ∨ Q)" in ent.label_):  
        # Increment count
        ERA_DSJ_count += 1

ERA_CNJ_count = 0

# Iterate over all the entities
for ent in doc_ERA_data.ents:
    if ("(P ∧ Q)" in ent.label_): 
        # Increment count
        ERA_CNJ_count += 1
        
ERA_MIMP_count = 0

# Iterate over all the entities
for ent in doc_ERA_data.ents:
    if ("(P → Q)" in ent.label_): 
        # Increment count
        ERA_MIMP_count += 1

ERA_JD_count = 0

# Iterate over all the entities
for ent in doc_ERA_data.ents:
    if ("(P ↓ Q)" in ent.label_):  
        # Increment count
        ERA_JD_count += 1

ERA_CIMP_count = 0

# Iterate over all the entities
for ent in doc_ERA_data.ents:
    if ("(P ← Q)" in ent.label_):  
        # Increment count
        ERA_CIMP_count += 1
        
# Print count
print("(¬ P) =", ERA_NEG_count)
print("(P ∨ Q) =", ERA_DSJ_count)
print("(P ∧ Q) =", ERA_CNJ_count)
print("(P → Q) =", ERA_MIMP_count)
print("(P ↓ Q) =", ERA_JD_count)
print("(P ← Q) =", ERA_CIMP_count)

# using split() 
# to count words in string 
res = len(text.split())
total_w = res
# printing result 
print ("Total number of words = " +  str(total_w)) 

print ("Total number of logical connectives = " + str(ERA_NEG_count + ERA_DSJ_count + ERA_CNJ_count + ERA_MIMP_count + ERA_JD_count + ERA_CIMP_count))

total_lc = ERA_NEG_count + ERA_DSJ_count + ERA_CNJ_count + ERA_MIMP_count + ERA_JD_count + ERA_CIMP_count
ratio = (total_lc / total_w)
print ("Logical connective count / number of words = " + str(ratio))





from readability import Readability
r = Readability(clean_statute)

f = r.flesch()

print("Flesch score = " + str(f.score) + " " + str(f.ease))

#ERA_lc_quotient = (total_lc / total_w * 100)

#ERA_lc_quotient_rd = str(round(ERA_lc_quotient, 2))

#print (ERA_lc_quotient_rd)
#print (ERA_lc_quotient)

#print ("Logical connective quotient = " + str(round(ERA_lc_quotient, 2)))

###########################

ERA_NEG_quotient = (ERA_NEG_count / total_w)
ERA_DSJ_quotient = (ERA_DSJ_count / total_w)
ERA_CNJ_quotient = (ERA_CNJ_count / total_w)
ERA_MIMP_quotient = (ERA_MIMP_count / total_w)
ERA_JD_quotient = (ERA_JD_count / total_w)
ERA_CIMP_quotient = (ERA_CIMP_count / total_w)

ERA_total_quotient = ERA_NEG_quotient + ERA_DSJ_quotient + ERA_CNJ_quotient + ERA_MIMP_quotient + ERA_JD_quotient + ERA_CIMP_quotient


print ("Negation (¬ P) quotient = " + str(round(ERA_NEG_quotient, 5)))
print ("Disjunction (P ∨ Q) quotient = " + str(round(ERA_DSJ_quotient, 5)))
print ("Conjunction (P ∨ Q) quotient = " + str(round(ERA_CNJ_quotient, 5)))
print ("Material Implication (P → Q) quotient = " + str(round(ERA_MIMP_quotient, 5)))
print ("Joint Denial (P ↓ Q) quotient = " + str(round(ERA_JD_quotient, 5)))
print ("Converse Implication (P ← Q) quotient = " + str(round(ERA_CIMP_quotient, 5)))
print ("Total logical connective quotient = " + str(round(ERA_total_quotient, 2)))

In [None]:
#!/usr/bin/env python
# a bar plot with errorbars
import numpy as np
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')

data = [("(¬ P)", ERA_NEG_count), ("(P ∨ Q)", ERA_DSJ_count), ("(P ∧ Q)", ERA_CNJ_count), ("(P → Q)", ERA_MIMP_count), ("(P ↓ Q)", ERA_JD_count), ("(P ← Q)", ERA_CIMP_count)]
names, values = zip(*data)  
# names = [x[0] for x in data]  # These two lines are equivalent to the the zip-command.
# values = [x[1] for x in data] # These two lines are equivalent to the the zip-command.

ind = np.arange(len(data))  # the x locations for the groups
width = 0.35       # the width of the bars

fig, ax = plt.subplots(figsize=(8,10))
rects1 = ax.bar(ind, values, width, color='navy', label="Employment Rights Act 1996")

# add some text for labels, title and axes ticks
ax.set_ylabel('Count')
ax.set_xlabel('Logical connective')
ax.set_title('Occurrences by logical connective in ERA 1996')
ax.legend()
ax.set_xticks(ind)
ax.set_xticklabels(names)



def autolabel(rects1):
    """Attach a text label above each bar in *rects*, displaying its height."""
    for rect in rects1:
        height = rect.get_height()
        ax.annotate('{}'.format(height),
                    xy=(rect.get_x() + rect.get_width() / 2, height),
                    xytext=(0, 0),  # 3 points vertical offset
                    textcoords="offset points",
                    ha='center', va='bottom')


autolabel(rects1)

plt.savefig("era.pdf")

plt.show()

<a id = "7"></a><br>
<div align="center" style="background-color:#000000; color:white;"><font size="5"><b>7. Hypothesis test and conclusion</b></font></div>

<font size="3">Import Freedom of Information Act 2000 (FIA), Data Protection Act 1998 (DPA), and Health and Safety at Work etc. Act 1974 (HSWA).</font>

In [None]:
import requests

statute = requests.get("https://www.legislation.gov.uk/ukpga/2000/36/enacted/data.html") #Freedom of Information Act 2000
FIA = statute.content
from bs4 import BeautifulSoup
soup1 = BeautifulSoup(FIA)
clean_FIA = soup1.get_text()

statute = requests.get("https://www.legislation.gov.uk/ukpga/1998/29/enacted/data.html") #Data Protection Act 1998
DPA = statute.content
from bs4 import BeautifulSoup
soup2 = BeautifulSoup(DPA)
clean_DPA= soup2.get_text()

statute = requests.get("https://www.legislation.gov.uk/ukpga/1974/37/enacted/data.html") #Health and Safety at Work etc. Act 1974
HSWA = statute.content
from bs4 import BeautifulSoup
soup3 = BeautifulSoup(HSWA)
clean_HSWA = soup3.get_text()

print("Statutes imported and cleaned")

<div align="left"><font size="5"><b>7.1 Analytics and inferences</b></font></div><a id = "7.1"></a><br>
<font size="3">Collecting data from interences of the statutes for statistical testing.</font>

In [None]:
FIA_data = clean_FIA
doc_FIA_data = prdnlp(FIA_data)

DPA_data = clean_DPA
doc_DPA_data = prdnlp(DPA_data)

HSWA_data = clean_HSWA
doc_HSWA_data = prdnlp(HSWA_data)

print("Inferences complete")

In [None]:
FIA_NEG_count = 0

# Iterate over all the entities
for ent in doc_FIA_data.ents:
    if ("(¬ P)" in ent.label_):  # isues counting (¬ P) when 1
        # Increment count
        FIA_NEG_count += 1
    
FIA_DSJ_count = 0

# Iterate over all the entities
for ent in doc_FIA_data.ents:
    if ("(P ∨ Q)" in ent.label_):  
        # Increment count
        FIA_DSJ_count += 1

FIA_CNJ_count = 0

# Iterate over all the entities
for ent in doc_FIA_data.ents:
    if ("(P ∧ Q)" in ent.label_): 
        # Increment count
        FIA_CNJ_count += 1
        
FIA_MIMP_count = 0

# Iterate over all the entities
for ent in doc_FIA_data.ents:
    if ("(P → Q)" in ent.label_): 
        # Increment count
        FIA_MIMP_count += 1

FIA_JD_count = 0

# Iterate over all the entities
for ent in doc_FIA_data.ents:
    if ("(P ↓ Q)" in ent.label_):  
        # Increment count
        FIA_JD_count += 1

FIA_CIMP_count = 0

# Iterate over all the entities
for ent in doc_FIA_data.ents:
    if ("(P ← Q)" in ent.label_):  
        # Increment count
        FIA_CIMP_count += 1
        

# using split() 
# to count words in string 
FIA_res = len(FIA_data.split()) 
FIA_total_w = FIA_res

#total_lc = FIA_NEG_count + FIA_DSJ_count + FIA_CNJ_count + FIA_MIMP_count + FIA_JD_count + FIA_CIMP_count
#FIA_lc_quotient = (total_lc / total_w * 100)

FIA_NEG_quotient = (FIA_NEG_count / FIA_total_w)
FIA_DSJ_quotient = (FIA_DSJ_count / FIA_total_w)
FIA_CNJ_quotient = (FIA_CNJ_count / FIA_total_w)
FIA_MIMP_quotient = (FIA_MIMP_count / FIA_total_w)
FIA_JD_quotient = (FIA_JD_count / FIA_total_w)
FIA_CIMP_quotient = (FIA_CIMP_count / FIA_total_w)

FIA_total_quotient = FIA_NEG_quotient + FIA_DSJ_quotient + FIA_CNJ_quotient + FIA_MIMP_quotient + FIA_JD_quotient + FIA_CIMP_quotient
print("Stats complete")

In [None]:
FIA_total_quotient = FIA_NEG_quotient + FIA_DSJ_quotient  + FIA_CNJ_quotient  + FIA_MIMP_quotient  + FIA_JD_quotient  + FIA_CIMP_quotient 
print (FIA_total_quotient)

In [None]:
DPA_NEG_count = 0

# Iterate over all the entities
for ent in doc_DPA_data.ents:
    if ("(¬ P)" in ent.label_):  # isues counting (¬ P) when 1
        # Increment count
        DPA_NEG_count += 1
    
DPA_DSJ_count = 0

# Iterate over all the entities
for ent in doc_DPA_data.ents:
    if ("(P ∨ Q)" in ent.label_):  
        # Increment count
        DPA_DSJ_count += 1

DPA_CNJ_count = 0

# Iterate over all the entities
for ent in doc_DPA_data.ents:
    if ("(P ∧ Q)" in ent.label_): 
        # Increment count
        DPA_CNJ_count += 1
        
DPA_MIMP_count = 0

# Iterate over all the entities
for ent in doc_DPA_data.ents:
    if ("(P → Q)" in ent.label_): 
        # Increment count
        DPA_MIMP_count += 1

DPA_JD_count = 0

# Iterate over all the entities
for ent in doc_DPA_data.ents:
    if ("(P ↓ Q)" in ent.label_):  
        # Increment count
        DPA_JD_count += 1

DPA_CIMP_count = 0

# Iterate over all the entities
for ent in doc_DPA_data.ents:
    if ("(P ← Q)" in ent.label_):  
        # Increment count
        DPA_CIMP_count += 1
        

# using split() 
# to count words in string 
DPA_res = len(DPA_data.split()) 
DPA_total_w = DPA_res

DPA_NEG_quotient = (DPA_NEG_count / DPA_total_w)
DPA_DSJ_quotient = (DPA_DSJ_count / DPA_total_w)
DPA_CNJ_quotient = (DPA_CNJ_count / DPA_total_w)
DPA_MIMP_quotient = (DPA_MIMP_count / DPA_total_w)
DPA_JD_quotient = (DPA_JD_count / DPA_total_w)
DPA_CIMP_quotient = (DPA_CIMP_count / DPA_total_w)

DPA_total_quotient = DPA_NEG_quotient + DPA_DSJ_quotient + DPA_CNJ_quotient + DPA_MIMP_quotient + DPA_JD_quotient + DPA_CIMP_quotient
print("Stats complete")

In [None]:
HSWA_NEG_count = 0

# Iterate over all the entities
for ent in doc_HSWA_data.ents:
    if ("(¬ P)" in ent.label_):  # isues counting (¬ P) when 1
        # Increment count
        HSWA_NEG_count += 1
    
HSWA_DSJ_count = 0

# Iterate over all the entities
for ent in doc_HSWA_data.ents:
    if ("(P ∨ Q)" in ent.label_):  
        # Increment count
        HSWA_DSJ_count += 1

HSWA_CNJ_count = 0

# Iterate over all the entities
for ent in doc_HSWA_data.ents:
    if ("(P ∧ Q)" in ent.label_): 
        # Increment count
        HSWA_CNJ_count += 1
        
HSWA_MIMP_count = 0

# Iterate over all the entities
for ent in doc_HSWA_data.ents:
    if ("(P → Q)" in ent.label_): 
        # Increment count
        HSWA_MIMP_count += 1

HSWA_JD_count = 0

# Iterate over all the entities
for ent in doc_HSWA_data.ents:
    if ("(P ↓ Q)" in ent.label_):  
        # Increment count
        HSWA_JD_count += 1

HSWA_CIMP_count = 0

# Iterate over all the entities
for ent in doc_HSWA_data.ents:
    if ("(P ← Q)" in ent.label_):  
        # Increment count
        HSWA_CIMP_count += 1
        

# using split() 
# to count words in string 
HSWA_res = len(HSWA_data.split()) 
HSWA_total_w = HSWA_res

HSWA_NEG_quotient = (HSWA_NEG_count / HSWA_total_w)
HSWA_DSJ_quotient = (HSWA_DSJ_count / HSWA_total_w)
HSWA_CNJ_quotient = (HSWA_CNJ_count / HSWA_total_w)
HSWA_MIMP_quotient = (HSWA_MIMP_count / HSWA_total_w)
HSWA_JD_quotient = (HSWA_JD_count / HSWA_total_w)
HSWA_CIMP_quotient = (HSWA_CIMP_count / HSWA_total_w)

HSWA_total_quotient = HSWA_NEG_quotient + HSWA_DSJ_quotient + HSWA_CNJ_quotient + HSWA_MIMP_quotient + HSWA_JD_quotient + HSWA_CIMP_quotient

print("Stats complete")

In [None]:
#testing stats outputs

print (FIA_NEG_quotient)
print (DPA_NEG_quotient)
print (HSWA_NEG_quotient)

print (HSWA_total_w)
print (DPA_total_w)
print (FIA_total_w)

<font size="3">Data is used to create a matrix and bar charts</font>

In [None]:
# importing package 
import matplotlib
import matplotlib.pyplot as plt 
import pandas as pd 
import seaborn as sns
sns.set_style("dark")
  
matplotlib.style.use('ggplot')
font = {'family' : 'normal',
        'weight' : 'normal',
        'size'   : 12}                          
matplotlib.rc('font', **font)

# create data 
df = pd.DataFrame([['(¬ P)', FIA_NEG_quotient , DPA_NEG_quotient , HSWA_NEG_quotient ], ['(P ∨ Q)', FIA_DSJ_quotient , DPA_DSJ_quotient , HSWA_DSJ_quotient ], ['(P ∧ Q)', FIA_CNJ_quotient , DPA_CNJ_quotient , HSWA_CNJ_quotient ], 
                   ['(P → Q)', FIA_MIMP_quotient , DPA_MIMP_quotient , HSWA_MIMP_quotient ],['(P ↓ Q)', FIA_JD_quotient , DPA_JD_quotient , HSWA_JD_quotient ],['(P ← Q)', FIA_CIMP_quotient , DPA_CIMP_quotient , HSWA_CIMP_quotient ]], 
                  columns=['Logical connective', 'FIA', 'DPA', 'HSWA']) 
# view data 
print(df) 

# pandas plot grouped bar chart 
df.plot(x='Logical connective',
        figsize=(10,8),
        kind='bar', 
        rot=0,
        stacked=False, 
        title='Logical connective quotient  type per legislation') 

plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
#add df.plot(grid=True)?

plt.savefig("fiadpahswa.pdf")

<div align="left"><font size="5"><b>7.2 Significant difference test</b></font></div><a id = "7.2"></a>

In [None]:
#!/usr/bin/env python
# a bar plot with errorbars
import numpy as np
import matplotlib.pyplot as plt

data = [("FIA", FIA_total_quotient), ("DPA", DPA_total_quotient), ("HSWA", HSWA_total_quotient)]
names, values = zip(*data)
# names = [x[0] for x in data]  # These two lines are equivalent to the the zip-command.
# values = [x[1] for x in data] # These two lines are equivalent to the the zip-command.

ind = np.arange(len(data))  # the x locations for the groups
width = 0.25    # the width of the bars

fig, ax = plt.subplots(figsize=(6,8))
rects = ax.bar(ind, values, width, color='navy', label=" PQ quotient")


# add some text for labels, title and axes ticks
ax.set_ylabel('PQ quotient')
ax.set_xlabel('Legislation')
ax.set_title('Logical connective quotient per legislation')
ax.legend()
#ax.legend(loc="upper center", bbox_to_anchor=(0.5, 1.15), ncol=2)
ax.set_xticks(ind)
ax.set_xticklabels(names)




def autolabel(rects):
    """Attach a text label above each bar in *rects*, displaying its height."""
    for rect in rects:
        height = rect.get_height()
        ax.annotate(f'{rect.get_height():0.3f}',
                    xy=(rect.get_x() + rect.get_width() / 2, height),
                    xytext=(0, 0),  # 3 points vertical offset
                    textcoords="offset points",
                    ha='center', va='bottom')


autolabel(rects)

plt.savefig("fiadpahswa_lcq.pdf")

plt.show()

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

df4 = pd.DataFrame({"legislation":["FIA", "DPA","HSWA"],
                    "quotient":[FIA_total_quotient, DPA_total_quotient, HSWA_total_quotient]})

_, ax = plt.subplots(figsize = (9,9))
wedges,_,_ = ax.pie(df4["quotient"]
                    ,labels=df4["legislation"]
                    ,shadow=False,startangle=90, autopct="%1.1f%%"
                    ,textprops={'fontsize': 12})
ax.legend(wedges,df4["legislation"], loc='upper right', ncol=1, prop={'size': 10});

plt.title("PQ quotient comparison per legislation")

In [None]:
# Friedman test
from numpy.random import seed
from numpy.random import randn
from scipy.stats import friedmanchisquare
# seed the random number generator
seed(1)
# generate three independent samples
data1 = [FIA_NEG_quotient, FIA_DSJ_quotient, FIA_CNJ_quotient, FIA_MIMP_quotient, FIA_JD_quotient, FIA_CIMP_quotient]
data2 = [DPA_NEG_quotient, DPA_DSJ_quotient, DPA_CNJ_quotient, DPA_MIMP_quotient, DPA_JD_quotient, DPA_CIMP_quotient]
data3 = [HSWA_NEG_quotient, HSWA_DSJ_quotient, HSWA_CNJ_quotient, HSWA_MIMP_quotient, HSWA_JD_quotient, HSWA_CIMP_quotient]
# compare samples
stat, p = friedmanchisquare(data1, data2, data3)
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
	print('No statistically significant difference (accept H0)') 
else:
	print('Statistically significant difference (Ha holds true reject H0')

<div align="left"><font size="5"><b>7.3 Conclusion</b></font></div><a id = "7.3"></a>
<font size="3">$H_{0}$ holds true: there is no statistically significant difference in NER $L_{1}$ quotients among the groups of statute data $\chi^2$($2, N = 6)=1.0, p >α$