# Natural Language Processing on a UK Parliamentary Debate Transcript
### Analysing the sentiment arising in a debate about Scottish Independence
By Yente Meijers
<pre style="font-size: 18px; font-family: Times; text-align: left;
    white-space: pre-line;">
In this lab report, I will analyse the sentiments and the topics that arise in a parliamentary debate surrounding <b>Scottish Independence and the Scottish Economy</b> in the British House of Commons. Scottish independence has been a big political topic in the United Kingdom (UK) ever since the Scottish National Party (SNP) rose to power in the early 2000s with the promise of an independence referendum (Broun, 2013).

Understanding the topics that arise when the issue of Scottish independence is debated as well as the sentiment expressed by different Members of Parliament (MPs) some of whom may belong to the SNP and other might belong to the Labour or Conservative parties, is of great importance for comprehending the wider political discussion on this topic (Beasley et al., 2016). The results of the analysis can help researchers and the public understand what stakes are involved in this issue for different parties.

The debate analysed in this report took place on the 2<sup>nd</sup> of November 2022 in the House of Commons, which was before the British supreme court ruled that a Scottish Independence Referendum would be unlawful. This ruling makes it especially interesting to look at what preceded this decision in the debates in the parliament. 

This report uses computational <b>content analysis</b>, which is the "systematic, objective, quantitative analysis of message characteristics" (Neuendorf, 2017, p. 2) but then automated with the help of computational methods instead of human analysis to gain a deeper understanding of what is discussed and with what sentiment in the debate about Scottish Indendence. To analyse larger amounts of data, I will use <b>Natural Language Processing (NLP)</b>, which allows researchers scale up how much textual data can be analysed (Kedia & Rasu, 2020).

The Hansard Parliamentary Debate Transcripts are a tremendous data source as they contain multitudes of information about the political debates happening in the United Kingdom. They are also free to access <a href="https://hansard.parliament.uk/">online</a> and published very soon after the debates take place. For data-processing purposes, it is also convenient that all of the transcipts come in the same .txt formatting with similar structures of white lines, making it easier to parse the raw text and extract information about the speakers.

The transcripts have been used in Natural Language Processing before, most notably in several papers by Abercrombie and Batista-Navarro (2018, 2019, 2020, 2020a). Abercrombie and Batista-Navarro used a BERT-based model to perform NLP on analyse British parliamentary debate transcripts. They have done different types of analysis on the transcripts, including sentiment analysis.


<i><b>Important note</b>: In order to run all the code cells in this Jupyter notebook, the user must ensure that it is in the same environment as the additional materials such as the data file and additional Python scripts used for data pre-processing. All the necessary files can be found in this <b><a href="https://github.com/a-kell/SIMM71_NLP">GitHub repository</a></b>. 
The necessary scripts are:</i> 
 • custom_types.py
 • parse_raw_files.py
 • speaker_analyse.py
<i>The data is called:</i>
 • scotdebate.txt
 <i>For installing the required packages you also need:</i>
 • requirements.txt
</pre>

In [1]:
# Install the required packages
!pip install -r requirements.txt

# Load required packages
from collections import defaultdict, Counter
from enum import Enum
from pathlib import Path
from pprint import pp
from custom_types import EnhancedNode
from parse_raw_file import parse_file
from statistics import mean

import pyLDAvis
import pyLDAvis.sklearn
from sklearn.feature_extraction import text 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

from transformers import pipeline
from speaker_analyse import speaker_cache
from collections import defaultdict, Counter
import pandas as pd

import plotly.io as pio
import plotly.express as px



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\yente\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
  from imp import reload


## Data Pre-Processing and Cleaning
<pre style="font-size: 18px; font-family: Times;  text-align: left;
    white-space: pre-line;">
The debate transcripts are only available in a .txt file, with no rows or columns, so the data is basically one big block of text with some white lines. For the purpose of analysis, I want to extract the <b>title</b> of the debate, the <b>time</b> of the debate, the <b>speaker</b> and <b>information about the speaker</b>. So, before I start any NLP, I am going to process the debate text so that it is in the format of speech by speaker, with metadata about each speaker available as well.

Due to the extensive nature of the data pre-processing and cleaning required to get the debate transcript in the correct format for NLP, I will call several Python scripts in the notebook that should also be available in the same environment. These scripts are also annoted to explain clearly what happens in each step but I will give a brief overview below.</pre>

<h3 style="margin-left: 100px; font-family: Times; font-size: 20px">1. Parsing the text</h3>
        <pre style="font-size: 18px; font-family: Times; white-space: pre-wrap;">
                In the first step, I want to run through the data to find the lines that contain information that I want to extract. First, I extract the title of the debate, which is always on the first line. Then, I remove the time stamps from the transcripts because this information is irrelevant for my analysis. 
        </pre>

<h3 style="margin-left: 100px; font-family: Times; font-size: 20px">2. Finding the speaker</h3>
        <pre style="font-size: 18px; font-family: Times; white-space: pre-wrap;">
                After that, I extracted the lines which contain a speaker and saved that information seperately. 
        </pre>

<h3 style="margin-left: 100px; font-family: Times; font-size: 20px">3. Adding the speaker characteristics</h3>
        <pre style="font-size: 18px; font-family: Times; white-space: pre-wrap;">
                The speaker line contains more information about the speaker besides their name, it also states which region in the UK they represent and which party they belong to. All of this information is also stored.
        </pre>

<h3 style="margin-left: 100px; font-family: Times; font-size: 20px">4. Creating a corpus</h3>
        <pre style="font-size: 18px; font-family: Times; white-space: pre-wrap;">
                To analyse the text, I put it in multiple different formats. For the summary statistics, I divide the text by speech per speaker and I also count the number of speakers associated with each party. In this way, I can then also calculate the average length of the speeches in words. For the topic model, I add all the text together with the speaker names removed to get the topics for the entire debate, not just by speaker. For the sentiment analysis, I add all the speeches for one speaker together and I break the speeches up into sentences with the nltk package, which allows handling punctuation, contractions etc. correctly. 
        </pre>

In [2]:
# Create filepath for the textfile containing the debate text
filepath = Path("scotdebate.txt")

# Runs all the pre-processing code in the Python scripts
title, results = parse_file(filepath)

len(enhanced_results) = 343


## Summary Statistics
<pre style="font-size: 18px; font-family: Times;  text-align: left;
    white-space: pre-line;">
To get a better understanding of the parliamentary debate, I will first perform some simple descriptive statistics about the speakers to get a general sense of the debate. I will also do some simple <b>topic modelling</b> to show what kind of things tend to come up when Scottish independence is debated in the British parliament. All of this is helpful to then better understand the results of the sentiment analysis. </pre>

### Speaker information and average speech length
<pre style="font-size: 18px; font-family: Times;  text-align: left;
    white-space: pre-line;">
To get an idea about the Members of Parliament (MPs) participating in the debate, I will calculate how many speakers there are as well as which party they belong to and how long they tend to speak  on average. 
</pre>

In [3]:
class GroupBy(Enum):
    Speaker = "Speaker"
    Party = "Party"
    All = "All"


def group_text(results: list[EnhancedNode], groupby: GroupBy) -> dict[str, list[str]]:
    # Takes a list of result nodes, and what to group the text by, and returns a dictionary.
    ## Keys are the groupby value, speaker, party or all for all text. And the values are lists of text

    output = defaultdict(list)

    if groupby == GroupBy.All:
        for result in results:
            output["All"].extend(result.text)
    elif groupby == GroupBy.Party:
        for result in results:
            output[result.party].extend(result.text)
    elif groupby == GroupBy.Speaker:
        for result in results:
            output[result.speaker].extend(result.text)

    return output


def get_speech_length_in_words(result: EnhancedNode) -> int:

    raw_text = "".join(result.text)
    words = raw_text.split(" ")

    return len(words)


def print_summary_stats(results: list[EnhancedNode]):

    #  Speaker counts
    speakers: Counter[str] = Counter()
    for result in results:
        speakers[result.speaker] += 1

    # Party counts
    parties: Counter[str] = Counter()
    already_seen_speakers = set()
    for result in results:

        if result.speaker not in already_seen_speakers:
            already_seen_speakers.add(result.speaker)
            parties[result.party] += 1

    # Number of speakers
    number_of_speakers = len(speakers.keys())
    number_of_speeches = sum(speakers.values())
    print(f"{number_of_speakers = }")
    print(f"{number_of_speeches = }")

    # Party alignment
    for party, count in parties.most_common():
        print(f"{party = } {count = }")

    # Speaker frequency
    for speaker, count in speakers.most_common():
        print(f"{speaker = } {count = } frequency = {count / number_of_speeches :2f}")

    # Speech length
    speech_word_counts = [get_speech_length_in_words(x) for x in results]

    print(f"{mean(speech_word_counts) = :.0f}")


if __name__ == "__main__":
    # pp(group_text(results, GroupBy.All))
    # pp(group_text(results, GroupBy.Party))
    # group_text(results, GroupBy.Speaker)

    print_summary_stats(results)

number_of_speakers = 57
number_of_speeches = 343
party = 'SNP' count = 30
party = '' count = 9
party = 'Lab' count = 6
party = 'Con' count = 5
party = 'LD' count = 4
party = 'Alba' count = 2
party = 'DUP' count = 1
speaker = 'Ian Blackford' count = 38 frequency = 0.110787
speaker = 'Ian Murray' count = 37 frequency = 0.107872
speaker = 'Angus Brendan MacNeil' count = 25 frequency = 0.072886
speaker = 'Mr Jack' count = 24 frequency = 0.069971
speaker = 'Pete Wishart' count = 21 frequency = 0.061224
speaker = 'Mr Perkins' count = 18 frequency = 0.052478
speaker = 'David Duguid' count = 15 frequency = 0.043732
speaker = 'Robin Millar' count = 13 frequency = 0.037901
speaker = 'Mr Nigel Evans' count = 11 frequency = 0.032070
speaker = 'Martin Docherty-Hughes' count = 10 frequency = 0.029155
speaker = 'Steven Bonnar' count = 7 frequency = 0.020408
speaker = 'Kirsty Blackman' count = 7 frequency = 0.020408
speaker = 'Christine Jardine' count = 6 frequency = 0.017493
speaker = 'Neale Hanvey' 

|  **Number of Speakers**    | 57 |
| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------ |
| **Number of Speeches**                      | **343** |
| **Average Speech Length**                      | **155 words** |

<pre style="font-size: 18px; font-family: Times;  text-align: left;
    white-space: pre-line;">
There are a total of 57 speakers in the dataset, including 'Mr. Speaker' and 'Madam Deputy Speaker' but excluding 'Hon. Members'. There are many speakers and speeches, which again confirms the need for NLP to analyse parliamentary transcripts due to the large quantities of text and information to process. The output above also shows how often each MP in volved in the debate spoke but because there are so many speakers, this is not a useful metric to include for all the speakers. We can identify which MPs are the top speakers: Ian Blackford and Ian Murray dominated the debate. Blackford belongs to the SNP, so it makes a lot of sense that he was speaking up a lot. Murray belongs to the Labour party and represents south Edinburgh, a prosperous region that has an interst in maintaining good English-Scottish relations and has never elected an MP from the SNP.
</pre>

| Party   | Number of MPs    |
| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------ |
| **Scottish National Party**       | 30   |
| **Conservative**  | 5    |
| **Labour**                | 6 |
| **Liberal Democrat**    | 4   |
| **Alba Party**                     | 2   |
| **Democratic Unionist Party (Northern Ireland)** | 1     |
| **Not Specified** | 9    |

<pre style="font-size: 18px; font-family: Times;  text-align: left;
    white-space: pre-line;">
Above you can see how many speakers belong to which party. As you can see, most of the participants of the debate belong to the SNP, which makes considering the fact that Scottish independence is their main mission.The Alba party is also a Scotland-only party that is pro-independence. Labour, Conservative and Liberal Democrat are the major UK-wide parties who generally oppose Scottish independence. The Democratic Unionist Party is a Northern Irish party that is pro-UK and by extension anti-independence. The speakers in the debate that do not have a party affiliation are those who are involved in the debate as e.g. Mr Speaker or Madam Deputy Speaker.
</pre>

### Topic Modelling
<pre style="font-size: 18px; font-family: Times;  text-align: left;
    white-space: pre-line;">
To get an idea of what topics come up the most in the debate about Scottish independence, topic modelling is a useful technique. Topic modelling groups texts based on the words and the probability of a word belonging to a certain topic. Topic modelling can be done with several different algorithms and I will use <b>latent dirichlet allocation (LDA)</b>, which is used most commonly in scientific research (Egger, 2022). 

To get better results, it is important to remove stopwords and change all the words to lowercase.  In this way, the topic model has cleaner text to process. It is also important to get the TF-IDF value of the word, which indicates the relative importance of the word in relation to the text. I will run the topic model on the entire text, so not by speaker, to get a more general idea about the debate. I will run the topic model on three dimensions, which means that I will get the output for three topics. 
</pre>

In [4]:
# Save the corpus to be used for the topic model
debate_corpus = group_text(results, GroupBy.All)['All']

# Pre-processing
## Adding my own stop words
my_additional_stop_words = ["hon", "way", "right", "member", "gentleman", "speaker", 
                            "say", "friend", "just", "said", "did", "let" "think", "want", "make"]
stop_words = list(text.ENGLISH_STOP_WORDS.union(my_additional_stop_words))

## Make everything lowercase and remove stopwords
tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                stop_words = stop_words,
                                lowercase = True)

## Get TF values 
dtm_tf = tf_vectorizer.fit_transform(debate_corpus)

## Get TF-IDF values 
tfidf_vectorizer = TfidfVectorizer(**tf_vectorizer.get_params())
dtm_tfidf = tfidf_vectorizer.fit_transform(debate_corpus)

# Creating the LDA topic model
## for TF DTM
lda_tf = LatentDirichletAllocation(n_components=3, random_state=6, max_iter=100)
lda_tf.fit(dtm_tf)
## for TFIDF DTM
lda_tfidf = LatentDirichletAllocation(n_components=3, random_state=6, max_iter=100)
lda_tfidf.fit(dtm_tfidf)

# Visualise the model
pyLDAvis.enable_notebook()
pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer)

  by='saliency', ascending=False).head(R).drop('saliency', 1)


### Interpreting the topics
<pre style="font-size: 18px; font-family: Times;  text-align: left;
    white-space: pre-line;">
Based on the results from the LDA topic model, we can infer three different topics that appear to be the most prevalent in the debate about Scottish independence. The first topic includes words mostly to do with Scotland, people and independent, which indicates it relates to the matter of Scottish <b>identity</b>. The second topic contains words such as economy, poverty and energy, which means that this topic mostly concerns the question of the Scottish <b>economy</b>. The third topic again contains Scottish, SNP, referendum and independence, which means that it more directly relates to the matter of <b>independence</b> from the UK.

In the graph above, you can see that the topics do not overlap, which is good. The topics are similar in size, all containing about a third of the total tokens of the debate. 

As the topic model is just meant as exploratory analysis, I did not test the cohesion of my topics and the number of topics is based on the type of words that were included in the topics. I reran the topic model again after adding some of my own stop words that appeared to be cluttering the topics.
</pre>

## Sentiment Analysis
<pre style="font-size: 18px; font-family: Times;  text-align: left;
    white-space: pre-line;">
To further analyse the debate text, I will use sentiment analysis. Sentiment analysis is "a set of algorithms and techniques used to detect the sentiment (positive, negative, or neutral) of a given text" (Kedia & Rasu, 2020, p. 14). At this time, the most sophisticated way to do sentiment analysis by using transformer based model, which work better than a model such as Naive Bayes, which does not look at word sequences. Using Naive Bayes would also involve manually labelling, which defeats part of the purpose of automating content analysis. 
Huggingface's (2022) transformers is a pre-trained machine learning and Bidirectional Encoder Representations from Transformers(BERT)-based (Devlin et al., 2019) sentiment analysis tool, trained on massive amounts of language data. Packages like TextBlob, which is trained on PatternAnalyzer (based on the pattern library) and NaiveBayesAnalyzer (an NLTK classifier trained on a movie reviews corpus), or Vader, which is trained specifically for social media use, are thus far inferior to a model that is BERT-based and trained on a lot more data.

I will run the sentiment analysis per party to get an understanding of how different parties in the debate talk about Scottish independence. I chose to use DistilBERT, which is one of the pre-trained machine learning models included in Huggingface's transformers package. DistilBERT is a distilled version of BERT: smaller, faster, cheaper and lighter, reducing the size of the model to 40% but still getting 97% of the same results as the complete BERT (Huggingface, 2022).
</pre>

In [5]:
# Defining the NLP method to be used
classifier = pipeline("sentiment-analysis")
## The default model is DistilBERT, which is what I want to use

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [6]:
speaker_corpus = group_text(results, GroupBy.Speaker)

party_sentiments = defaultdict(list)

for speaker, sentences in speaker_corpus.items():

    # 1 string for all text by a speaker
    speaker_info = speaker_cache[speaker]
    sentiments = classifier(sentences)

    sentiment_labels: list[str] = [sentiment["label"] for sentiment in sentiments] # type: ignore
    party_sentiments[speaker_info.party].extend(sentiment_labels) 

# Rename the no party affiliation
party_sentiments["No Party"] = party_sentiments[""]
del party_sentiments[""]

In [7]:
# Get counts per sentiment type by party
party_positive_counts = Counter()
party_negative_counts = Counter() 


for party, labels in party_sentiments.items():

    for label in labels:
        if label == "POSITIVE":
            party_positive_counts[party] += 1
        elif label == "NEGATIVE":
            party_negative_counts[party] += 1
        else:
            raise ValueError("Unknown sentiment label")


print(party_positive_counts.most_common())
print(party_negative_counts.most_common())


party_positive_portion = {}

for party in party_sentiments.keys():
    party_positive_portion[party] = party_positive_counts[party] / (party_positive_counts[party] + party_negative_counts[party])

print(party_positive_portion.items())

df_positive = pd.DataFrame(party_positive_counts.items())
df_positive.columns = ["Party", "Count"]
df_positive["Label"] = "Positive"

df_negative = pd.DataFrame(party_negative_counts.items())
df_negative.columns = ["Party", "Count"]
df_negative["Label"] = "Negative"

graph_data = pd.concat((df_positive, df_negative))
graph_data.head()

[('SNP', 964), ('No Party', 267), ('Lab', 167), ('Con', 115), ('LD', 46), ('Alba', 31), ('DUP', 3)]
[('SNP', 832), ('Lab', 220), ('No Party', 169), ('Con', 91), ('LD', 53), ('Alba', 50), ('DUP', 2)]
dict_items([('SNP', 0.5367483296213809), ('Lab', 0.4315245478036176), ('LD', 0.46464646464646464), ('Con', 0.558252427184466), ('Alba', 0.38271604938271603), ('DUP', 0.6), ('No Party', 0.6123853211009175)])


Unnamed: 0,Party,Count,Label
0,SNP,964,Positive
1,Lab,167,Positive
2,LD,46,Positive
3,Con,115,Positive
4,Alba,31,Positive


## Sentiment visualisation
<pre style="font-size: 18px; font-family: Times;  text-align: left;
    white-space: pre-line;">
Now, to get a better sense of the sentiment per party, I will visualise the average sentiment of speeches per party.
</pre>

In [8]:
# Graph with counts per sentiment

# Set plotly rendering and templates defaults for the graphs
pio.renderers.default = "jupyterlab" # Make sure that the renderer is set to the correct platform that the user is running the notebook in
pio.templates.default = "plotly_white"


fig = px.bar(graph_data, x="Party", y="Count", color="Label", 
    title="Counts of Positive and Negative Sentiment per Party",)
fig.update_layout(font={"size":16})
fig.show()


distutils Version classes are deprecated. Use packaging.version instead.


distutils Version classes are deprecated. Use packaging.version instead.



In [9]:
# Graph with proportions

## Preparing the proportions for the graph
graph_data_prop = pd.DataFrame(party_positive_portion.items())
graph_data_prop.columns = ["Party", "Positive Sentiment"]
graph_data_prop["Negative Sentiment"] = 1 - graph_data_prop["Positive Sentiment"]
graph_data_prop.head()

## making
fig2 = px.bar(graph_data_prop, x="Party", y=["Positive Sentiment", "Negative Sentiment"], 
    title="Proportion of Positive and Negative Sentiment per Party")
fig2.update_yaxes(title="Proportion")
fig2.update_layout(legend_title="", font={"size":16})
fig2.show()


distutils Version classes are deprecated. Use packaging.version instead.


distutils Version classes are deprecated. Use packaging.version instead.



## Results
<pre style="font-size: 18px; font-family: Times;  text-align: left;
    white-space: pre-line;">
The results of the sentiment analysis show that, on average, Members of Parliament involved in the debate about Scottish Independence are slightly more negative in their speech sentiment. When looking at the sentiment per party, it becomes clear that there is a difference depending on which party the speaker is aligned with.

Because the sentiment is assigned per speech and not towards specific entities, it is hard to interpret if negative sentiment implies that the speech was negative towards Scottish independence or towards for example the UK. It could be that the Alba Party MPs, which has the highest proportion of negative sentiment in their speeches, just used more negative framing to promote independence rather than being anti-independence, which would not make sense for this pro-independence party. 

Therefore, the results of the sentiment analysis can mostly be used to get a general idea of what the debate surrounding Scottish independence is like in the House of Commons and if certain parties use more negative sentiments in their speeches surrounding this matter, be it against the UK or against independence, which would both count as negative sentiment.

Interestingly, the SNP has a higher percentage of positive sentiment (54%) in their speeches than the Alba party (38%), both of which are pro-independence. Because of their public support for Scottish independence, it can be assumed that they are speaking pro-independence but based on the sentiment we can hypothesise that the Alba party uses more negative framing, potentially anti-union and anti-UK language, rather than pro-Scotland language.

There does not appear to be a difference in sentiment related to whether a party is left or right. It is interesting that the Conservative party appears to express more positive sentiment (56%) than the left-leaning Labour party (43%) but again, this does not indicate whether these parties were speaking negatively or positively about independence but rather concerns the general sentiment of their speeches. The SNP is neither right nor left but very much positioned in the center and thus Scottish independence is not put forward as a left-rigth ideological matter. 

</pre>

### Discussion
<pre style="font-size: 18px; font-family: Times;  text-align: left;
    white-space: pre-line;">
One important thing to keep in mind is that the sentiment analysis was performed on a debate transcript, which is thus spoken word. In general, sentiment analysis is more common to do on written texts rather than speeches written down as text. Also, the SNP very much dominated this debate, which means that the inferences based on the speeches from MPs belonging to other parties are not as generalisable.
</pre>

## References
<pre style="font-size: 18px; font-family: Times;  text-align: left;
    white-space: pre-line;">
Abercrombie, G. & Batista-Navarro, R. (2018) ‘Identifying Opinion-Topics and Polarity of Parliamentary Debate Motions’, in <i>Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis</i>, pp. 280–285. doi: 10.18653/v1/P17.

Abercrombie, G. & Batista-Navarro, R. (2018) ‘“Aye” or “No”? Speech-level Sentiment Analysis of Hansard UK Parliamentary Debate Transcripts’, in <i>Proceedings of the Eleventh International Conference on Language Resources and Evaluation</i>.

Abercrombie, G., Batista-Navarro, R. Nanni, F. & Ponzetto, S. P. (2019) ‘Policy Preference Detection in Parliamentary Debate Motions’, in <i>Proceedings of the 23rd Conference on Computational Natural Language Learning</i>, pp. 249–259. Available at: <a>https://www.publicwhip.org.uk (Accessed: 6 November 2022)</a>.

Abercrombie, G. & Batista-Navarro, R. (2020) ‘ParlVote: A Corpus for Sentiment Analysis of Political Debates’, in <i>Proceedings of the 12th Conference on Language Resources and Evaluation</i>, pp. 11–16. doi: 10.17632/czjfwgs9tm.1.

Abercrombie, G. & Batista-Navarro, R. (2020) ‘Sentiment and position-taking analysis of parliamentary debates: a systematic literature review’, <i>Journal of Computational Social Science</i>, 3, pp. 245–270. doi: 10.1007/s42001-019-00060-w.

Beasley, R., Kaarbo, J. and Solomon-Strauss, H., (2016). 'To be or not to be a state? Role contestation in the debate over Scottish independence'. In <i>Domestic role contestation, foreign policy, and international relations</i> (pp. 156-172). Routledge.

Broun, D. (2013). Scottish independence and the idea of Britain. Edinburgh: Edinburgh University Press.

Devlin, J. et al. (2019) ‘BERT: Pre-training of deep bidirectional transformers for language understanding’, <i>NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference</i>, 1, pp. 4171–4186.

Egger, R. (2022). ‘Topic Modelling.’ In: Egger, R. (eds) <i>Applied Data Science in Tourism. Tourism on the Verge</i>. Springer, Cham. <a>https://doi.org/10.1007/978-3-030-88389-8_18</a>

Huggingface (2022) 🤗 Transformers. Available at: <a>https://huggingface.co/docs/transformers/index</a> (Accessed: 15 December 2022).

Kedia, A. & Rasu, M. (2020) <i>Hands-On Python Natural Explore tools and techniques to analyze and process text with</i>. Packt Publishing, Ltd.

Neuendorf, K. A. (2017) <i>The Content Analysis Guidebook</i>. Thousand Oaks, California: SAGE Publications. doi: 10.5260/chara.19.4.38.

Sharma, H.  (2021) <i>Topic Model Visualization using pyLDAvis</i>, Towards Data Science. Published 05 June 2021. Available at: <a>https://towardsdatascience.com/topic-model-visualization-using-pyldavis-fecd7c18fbf6</a> (Accessed: 15 December 2022).
</pre>