# Abstract

Observed regional variation in geotagged social media text is often attributed to dialects, where features in language are assumed to exhibit region-specific properties. While dialects are seen as a key component in defining the identity of regions, there are a multitude of other geographic properties that may be captured within natural language text. In our work, we consider locational mentions that are directly embedded within comments on the social media website Reddit, providing a range of associated semantic information, and enabling deeper representations between locations to be captured. Using a large corpus of geoparsed Reddit comments from UK-related local discussion subreddits, we first extract embedded semantic information using a large language model, aggregated into local authority districts, representing the semantic footprint of these regions. These footprints broadly exhibit spatial autocorrelation, with clusters that conform with the national borders of Wales and Scotland. London, Wales, and Scotland also demonstrate notably different semantic footprints compared with the rest of Great Britain.


In [1]:
import warnings

import polars as pl
import pandas as pd
import matplotlib.pyplot as plt

from src.common.utils import Paths, process_outs

warnings.filterwarnings("ignore")

plt.rcParams.update(
    {
        "font.size": 6,
        "text.usetex": False,
        "font.family": "sans-serif",
        "font.sans-serif": ["DejaVu Sans"],
    }
)

places, regions, lad, region_embeddings, lad_embeddings = process_outs()

# Introduction

The prevalence of social media data for use in geographic research has generated a renewed interest in the concept of 'place' [@wagner2020;@purves2019;@westerholt2018a], as contributions to social media are theorised to capture informal knowledge that represents a place-based understanding of geography [@goodchild2011;@sui2011]. In the context of language, this place-based knowledge is generated through 'vernacular geography', which describes the natural language used when informally describing geographic locations [@gao2017a;@goodchild2011;@waters2003;@hollenstein2008]. This informal knowledge incorporates biases regarding locations, better representing human perceptions of geography, compared with formal administrative definitions. In this sense, associations of geography drawn from social media capture place through a 'bottom-up' approach, building knowledge through experience rather than administrative formalisations [@agnew2005;@sui2011]. While many works have considered the formalisation of place through geotagged social media data, few have considered how the semantic properties of text may reveal geographic heterogeneity between regions, generated directly through vernacular geography. The components of vernacular geography are closely coupled with the identity of regions, where culture, topics, and general perceptions are captured through the language associated with locational mentions in text [@paasi2003;@buttimer2015].

A multitude of works have considered the geographic variation in geotagged social media text [@russ2012;@doyle2014;@huang2016;@goncalves2014;@perez2019;@arthur2019;@eisenstein2014], focussing primarily on how dialect variation is captured through differences in the vocabulary (lexicons) of contributors over geographic space. For example; Tweet lexicons originating in the North East of England are noticeably different compared with the South [@arthur2019]. While dialects do demonstrate geographic heterogeneity, they only present one component of language that may exhibit geographic variation and do not directly contribute properties associated with vernacular geography. This limitation stems primarily from the reliance of these works on geotagged social media, where the textual content rarely relates to the geotagged location [@kropczynski2018], meaning dialects are the only explainable trait that results in geographic heterogeneity.

In our work, we instead consider the ability to compare the geographic variation in semantic information relating to locational mentions embedded directly within social media text. This approach means that instead of solely focussing on dialects, our semantic differences capture a broad range of associations between locations, contributed by the vernacular geography of users. While a lexical approach explores the vocabulary of a language, we instead generate sentence embeddings using new developments in natural language processing, which enable nuanced semantic information to be numerically represented [@devlin2019]. Unlike simple lexical representations, sentence embeddings capture contextual semantic information [@hu2020]. While general topics of discussion are shared between locations, semantic representations are capable of capturing the differing context in which they are mentioned. For example, 'restaurants' are frequently discussed in location forums, but the way they are discussed is influenced by the distinctive culture of each location.

We name these representations the 'semantic footprints' of locations; capturing semantic traces relating to locations, contributed by individuals through a subset of their digital footprints [@walden-schreiner2018]. We then analyse these semantic footprints, to determine whether they exhibit spatial autocorrelation or geographically cohesive clustering. To generate an explainable characteristic of these footprints, we then explore whether generated national identities of location-associated text correlates with regions where footprints appear more semantically isolated. To achieve this, we utilise the emergent properties of large language models (LLMs), where a task known as zero-shot classification enables models to assign labels to text, without any annotated training data. We query an LLM to attribute a specific sub-nationality within the United Kingdom to each of our comments and explore whether the varying strength of these nationalities correlate with differences in our semantic footprints.

@sec-literature first gives an overview of work exploring semantic variation in social media text, regional identities, and how our approach differs to related work. @sec-methodology describes our data, then outlines the processing used to generate semantic footprints and describes our geographic analysis of these footprints. @sec-results presents our results and @sec-conclusion concludes with suggestions for future work.

# Geographic Variation in Social Media Text {#sec-literature}

While formal geographic regions within Great Britain are typically designed for administrative and political purposes, they are non-restrictive in how populations can move between them. The level of geographic cohesion between regions across Great Britain is often studied from the context of mobility, where data sources like Census or transport records describe the physical movement of populations and individuals across geographic space [@rae2009;@titheridge2009], or through non-physical networks using phone records [@sobolevsky2013;@reades2009;@zheng2015;@lambiotte2008], and social media [@lengyel2015;@arthur2019;@sui2011]. When these networks are examined, cohesive clusters develop, which broadly appear to correlate with administrative boundaries [@arthur2019;@ratti2010].

Alternatively, many works have taken advantage of the abundance of geotagged social media text, to examine regional differences in dialects [@huang2016;@eisenstein2014;@goncalves2014;@arthur2019;@russ2012;@han2012;@doyle2014;@zheng2018]. Many of these works have noted that, like online or physical networks, geographically cohesive properties emerge, which appear to correlate with administrative boundaries [@huang2016;@eisenstein2014;@goncalves2014;@arthur2019]. These results conform with the idea that dialects are an important component in the identity of regions [@haesly2005;@llamas2014;@llamas2009]. Despite this, dialects only present a single component of language that contributes to a sense of geographic identity between regions [@middleton2008;@haesly2005], ignoring the wealth of vernacular geography that may also be captured in text [@evans2007;@sui2011;@berragan2023a]. 

Studies that consider dialect variation in social media text only consider geotags to be a geographically relatable feature of this data source. Given social media communication comprises a broad range of topics that do not necessarily relate to locational discussion, these geotags and associated text are unlikely to be directly related. Any observed regional variation is therefore only attributable to the dialect of the contributing author, with the assumption that the author is a resident in the geotagged location. In contrast to this approach, locational mentions embedded directly within text present an alternative method to explore how the language regarding locations varies geographically. Place names embedded within text directly can also be related with the surrounding context of their use, capturing the vernacular geography of contributing users [@sui2011;@evans2007]. Lexicons associated with locations identified in this manner therefore incorporate a broad range of topics, associations, and cultural information, rather than solely dialects, more broadly capturing the components of language that contribute to the identity of locations [@haesly2005]. In our work, we therefore extract place names from a collection of UK specific comments taken from the social media website Reddit, attributing coordinate information through a process called geoparsing [@purves2019], allowing for us to explore the geographic heterogeneity of text associated with identified locations.

While past works have primarily considered the statistical comparison between location-based lexicons, where word counts are associated with aggregate regions generated through geotagged Tweets, this approach is limited when considering the more nuanced semantic variations in vernacular geography. Recent progress in natural language processing have led to the development of large language models (LLMs) which are able to capture deep contextual semantic information from text, through sentence and word embeddings [@devlin2019]. Unlike a lexical approach, where word order and semantic information is not captured, these embeddings act as numerical representations of text which incorporate contextual semantic information in depth. Embeddings that are more semantically similar are closer together in their embedding space, meaning, like lexicons, these embeddings may be statistically compared. We therefore generate sentence embeddings for each comment in our corpus that contains a place name, which are then aggregated by location, forming what we call a semantic footprint. These footprints represent the collective geographic knowledge of each individual user in our corpus, built through their vernacular geography, capturing informal, place-based information through their perception of geoparsed locations [@sui2011;@goodchild2011]. 

In this work, we generate a new comparative measure between regions in the UK through an examination of text associated with locations, extracted from comments on the social media website Reddit. While past work has examined variation between regions from the perspective of social media networks, or by examining lexicons associated with geotagged social media messages, we examine regional variations derived from geoparsed embeddings generated from a large language model. Unlike using geotags, which ascribe linguistic features such as dialect to specific locations, our method instead captures any comment that mentions a location alongside its semantic context. Quantified information therefore does not reflect dialects associated with locations, but common semantic associations, embedding cultural information, or location-specific topics and opinions. Given users mentioning locations are not necessarily residents, these semantic associations represent a collective informal geographic knowledge generated through the vernacular geography of people across the UK, embedding their general semantic footprint.

# Methodology {#sec-methodology}

The following section first introduces our main data source; the social media website Reddit, from which we access a collection of user-submitted comments. Following this, we detail our methodology for generating semantic footprints from each of these comments, and how we analyse the geographic properties of these footprints.


In [2]:
all_comments = (
    pl.scan_parquet(Paths.RAW / "comments_combined-2023_02_23.parquet")
    .select(
        [
            pl.col("score").count().alias("count"),
            pl.col("author").n_unique().alias("n_authors"),
            pl.col("created_utc")
            .min()
            .alias("first_utc")
            .apply(lambda s: s.strftime("%Y-%m-%d")),
            pl.col("created_utc")
            .max()
            .alias("last_utc")
            .apply(lambda s: s.strftime("%Y-%m-%d")),
            pl.col("text").str.split(by=" ").explode().len().alias("total_words"),
        ]
    )
    .collect()
)
places_count = places.filter(pl.col("author") != "deleted").groupby("author").count()
nnp = int(places_count.quantile(0.99)["count"][0])
ldn = len(
    pl.scan_parquet(Paths.RAW / "places-2023_04_11.parquet")
    .filter(pl.col("word") == "london")
    .collect()
)
single = len(
    pl.scan_parquet(Paths.RAW / "places-2023_04_11.parquet")
    .group_by(pl.col("word"))
    .count()
    .filter(pl.col("count") == 1)
    .collect()
)
unq = len(
    pl.scan_parquet(Paths.RAW / "places-2023_04_11.parquet")
    .group_by(pl.col("word"))
    .count()
    .collect()
)

## Data

[Reddit](https://reddit.com) is a public discussion, news aggregation social network, and among the top 20 most visited websites in the United Kingdom. In 2020, Reddit had around 430 million active monthly users, comparable to the number of Twitter^[Now known as [X](x.com)] users [@murphy2019;@statista2022]. Reddit is divided into separate independent _subreddits_ each with specific topics of discussion, where _users_ may submit _posts_ which each have dedicated nested conversation threads that users can add _comments_ to. Subreddits cover a wide range of topics, and in the interest of geography, they also act as forums for the discussion of local places. The [United Kingdom subreddit](https://reddit.com/r/unitedkingdom) acts as a general hub for related topics, notably including a list of smaller and more specific related subreddits. This list provides a 'Places' section, a collection of local British subreddits, ranging in scale from country (`/r/England`), region (`/r/thenorth`, `/r/Teeside`), to cities (`/r/Manchester`) and small towns (`/r/Alnwick`). In total there are 213 subreddits that relate to 'places' within the United Kingdom^[https://www.reddit.com/r/unitedkingdom/wiki/british_subreddits]. We use the corpus generated by _anonymised_, which consists of a collection of all Reddit comments taken from each UK related subreddit [@baumgartner2020], with place names identified by a custom transformer-based named entity recognition model^[anonymised link]. In total `{python} f"{all_comments['count'][0]:,}"` comments were extracted, submitted by `{python} f"{all_comments['n_authors'][0]:,}"` unique users, between `{python} f"{all_comments['first_utc'][0]}"` and `{python} f"{all_comments['last_utc'][0]}"`. Table \ref{tbl-example} gives an example entry from this geoparsed Reddit corpus.


In [3]:
# | output: asis

variable = [
    "text",
    "",
    "word",
    "easting",
    "northing",
    "region",
    "lad",
    "author",
    "word\_count",
    "author\_count",
]
value = [
    "A Mexicana meal with extra wings ",
    "from Tex in Leytonstone.",
    "leytonstone",
    539268,
    187540,
    "London",
    "Waltham Forest",
    "t2\_eklyq",
    855,
    431,
]
desc = [
    "Comment",
    "",
    "Identified Place Name",
    "Place Name Easting",
    "Place Name Northing",
    "Administrative Region",
    "Local Authority District",
    "Anonymised Unique Author ID",
    "Total location mentions",
    "Unique authors mentioning this location",
]

print(
    pd.DataFrame({"Variable": variable, "Value": value, "Description": desc})
    .style
    .format(thousands=",")
    .hide(axis="index")
    .to_latex(
        hrules=True,
        label="tbl-example",
        caption="Summary of comments relating to each region in our study",
        position_float="centering",
    )
)

\begin{table}
\centering
\caption{Summary of comments relating to each region in our study}
\label{tbl-example}
\begin{tabular}{lll}
\toprule
Variable & Value & Description \\
\midrule
text & A Mexicana meal with extra wings  & Comment \\
 & from Tex in Leytonstone. &  \\
word & leytonstone & Identified Place Name \\
easting & 539,268 & Place Name Easting \\
northing & 187,540 & Place Name Northing \\
region & London & Administrative Region \\
lad & Waltham Forest & Local Authority District \\
author & t2\_eklyq & Anonymised Unique Author ID \\
word\_count & 855 & Total location mentions \\
author\_count & 431 & Unique authors mentioning this location \\
\bottomrule
\end{tabular}
\end{table}



In [4]:
num_comments = places.unique("text").n_unique()
num_words = places["text"].str.split(" ").list.lengths()

There are a total of `{python} f"{unq:,}"` unique locations in this corpus, with a highly skewed distribution in mentions. Many locations were only mentioned a single time (`{python} f"{single / unq:.0%}"`), while 'London' was mentioned in `{python} f"{ldn:,}"` comments. To reduce this skew, we sampled any location mentioned more than 5,000 times, retaining only up to 5,000 randomly sampled comments per location. The goal with this processing was to ensure that our generated embeddings did not simply become biased towards the word embedding for a single location, and instead capture a broader sense of an aggregate region. In our data subset, we find that 1% of users (`{python} f"{len(places_count.filter(pl.col('count') > nnp)):,}"`) mention `{python} f"{places_count.filter(pl.col('count') > nnp)['count'].sum() / places_count['count'].sum():.0%}"` of our place names. This subset leaves a total of `{python} f"{num_comments:,}"` comments containing place names. Comments range from `{python} f"{num_words.min():,}"` to `{python} f"{num_words.max():,}"` words in length, with a mean length of `{python} f"{num_words.mean():,.0f}"`.  Table \ref{tbl-sum} gives an overview of the number of comments, word count and number of places that were identified within each administrative region of the UK.


In [5]:
# | output: asis

from paper.tables import desc_tbl

print(
    desc_tbl(places, lad_embeddings).to_latex(
        hrules=True,
        label="tbl-sum",
        caption="Summary of comments relating to each region in our study.",
        position_float="centering",
    )
)

\begin{table}
\centering
\caption{Summary of comments relating to each region in our study.}
\label{tbl-sum}
\begin{tabular}{lrrrr}
\toprule
RGN22NM & Total Comments & Unique Words & Word Count & Total Places \\
\midrule
London & 222,745 & 454,971 & 26,144,378 & 6,338 \\
Scotland & 180,275 & 434,552 & 22,868,507 & 7,796 \\
South East & 146,887 & 384,919 & 16,565,810 & 7,935 \\
North West & 122,010 & 346,764 & 14,591,529 & 7,279 \\
South West & 100,291 & 304,622 & 11,209,793 & 6,117 \\
Yorkshire and The Humber & 92,690 & 286,316 & 10,801,344 & 6,304 \\
East Midlands & 90,785 & 280,912 & 10,179,007 & 6,557 \\
East of England & 79,511 & 260,249 & 8,495,673 & 4,936 \\
West Midlands & 61,346 & 233,914 & 7,285,005 & 4,846 \\
North East & 37,100 & 163,772 & 4,345,753 & 2,446 \\
Wales & 30,436 & 130,288 & 3,833,168 & 2,276 \\
None & 14,366 & 104,003 & 1,425,291 & 1,075 \\
\midrule \bfseries Total & \bfseries 852,461 & \bfseries 1,265,587 & \bfseries 137,745,258 & \bfseries 40,428 \\
\bottomrul

## Generating and Analysing Geographic Footprints

Statistical comparisons between two or more distinct texts first relies on an appropriate method for processing the text into a numerical format. Typically, a Term Frequency-Inverse Document Frequency (TF-IDF) approach is used to generate document embeddings [@daniel2007], which assigns word importance based on the frequency of mentions within a corpus. TF-IDF however does not have the capability to capture broader semantic information, given that there is no knowledge of the meaning behind words. Large Language Models (LLMs) instead are pre-trained on a very large corpus of natural language text, which, alongside their architecture, enables them to more appropriately consider semantic information [@devlin2019]. As with TF-IDF, text is input into these models and output as a numerical representation, which embeds words as high dimensional vectors, capturing contextual semantic information. 

This approach differs from past work that only considered a lexical analysis, where semantic information and context is not preserved, instead building vectors that act as semantic representations of locations identified in our corpus, which we name 'semantic footprints'. Given semantic information is preserved, locational embeddings are able to reflect the deeper associations between geographic locations, built from a multitude of contexts and perspectives, forming an aggregate representation. Any geographically cohesive relationships between footprints therefore demonstrate a direct association between geography and language, which hasn't been captured previously.

Once we generate these footprints we first explore how they produce emerging spatial structures from the bottom-up, generating clusters of small-scale geographic units to capture larger scale aggregations based on semantic information. In this analysis we find that our generated spatial structures broadly conform with larger scale administrative aggregations. We therefore then consider a top-down approach, using these larger administrative regions to generate a comparative analysis of aggregate footprints. To derive explainable characteristics of observed differences between these regions, we observe how national identities can be captured through text, and how these identities vary geographically.

## Creating Embeddings

We first create semantic embeddings for each comment in which a location was mentioned, using the `sentence-transformers` Python library [@reimers2019], with the `all-mpnet-base-v2` model^[https://huggingface.co/sentence-transformers/all-mpnet-base-v2]. With our selected embedding model, we then performed the following steps to generate embeddings for each Local Authority District (LAD) in Great Britain.

1. Masked any place name with a generic token: 'PLACE' (using place name text spans included in the corpus).
2. Generate sentence embeddings for each comment.
3. Group embeddings by LAD using identified locations, taking the mean embedding.

To visualise the outputs from this processing we consider an example comment $s_1 = \text{"I live in London."}$, shown on Equation \ref{eq-dims}. 

$$
\begin{aligned}
\mathit{s_{i}} &= \text{'I live in \textit{London}'} \\
\textbf{1. }\downarrow \\
\mathit{s_{i}} &= \text{'I live in \texttt{PLACE}'},
\end{aligned}
\qquad
\begin{aligned}
\textbf{2. }\mathit{s_{i}} \rightarrow 
\begin{bmatrix}
x_{1} \\
x_{2} \\
\vdots\\
x_{n}
\end{bmatrix},
\end{aligned}
\qquad
\begin{aligned}
\textbf{3. }\mathit{LAD_{j}} = 
\begin{bmatrix}
x_{1,1} & x_{1,2} & \cdots & x_{1,t} \\
x_{2,1} & x_{2,2} & \cdots & x_{2,t} \\
\vdots  & \vdots  & \ddots & \vdots  \\
x_{n,1} & x_{n,2} & \cdots & x_{n,t}
 \end{bmatrix} \rightarrow \begin{bmatrix}
\bar{x_{1}} \\
\bar{x_{2}} \\
\vdots \\
\bar{x_{n}}
\end{bmatrix}
\end{aligned}
$${#eq-dims}

In Equation \ref{eq-dims}, $n$ is the `sentence-transformers` embedding dimension (768), and $t$ is the total number of unique comments that relate to locations within a single LAD region ($LAD_j$). Values ($x_i$) in step **2.** are model weights that represent the embedding for the comment $s_i$, capturing semantic information. This process is also visually demonstrated on @fig-workflowfoot.

![Workflow diagram showing Reddit Corpus processed into sentence embeddings, then aggregated into location and LAD semantic footprints.](./figures/workflow.pdf){#fig-workflowfoot}

Given each $LAD$ has a variable number of comments associated with them, we must process associated embeddings into a 'semantic footprint' representation of a fixed size, so that they may be directly compared. To achieve this, all embeddings associated with comments relating to locations within a $LAD_j$ are processed into a one-dimension vector of size $1x768$. The most common approach for this dimensionality reduction uses 'mean-pooling'; taking the mean across all embeddings, which is common in tasks like topic analysis [@reimers2019].

By masking place names, we ensure that no comment embeddings accidentally incorporate geographically grounded information. For example, comments in South Eastern local authorities are likely to frequently mention London, given they are geographically proximal. Embeddings for these locations would therefore capture an association through the mention of London, rather than general semantic information. For our work, we want to exclude any geographic information, ensuring that embeddings solely capture semantic associations.

Given that transformers are a relatively new architecture in natural language processing, and the creation of these models require significant computational resources and training time, their use to date has been limited in related research. Our choice to use the transformer architecture stems from the emphasis we place on the extraction of nuanced and contextual semantic information, which is lost with lexical count-based methods like TF-IDF. It should be noted however that while TF-IDF methods are less complex, they are typically more interpretable; for instance, words that contribute importance to an embedding may be extracted from a TF-IDF model. The numerical representations of any text generated by transformers are not directly interpretable in this manner. The following section therefore analyses our semantic footprints with respect to their numerical representations, rather than through their lexicons.

## Spatial Clustering and Autocorrelation

It is reasonable to assume that there are LADs within our corpora that generate embeddings that capture similar semantic properties. A typical method to group unlabelled multi-variate data based on shared properties uses unsupervised clustering [@sinaga2020;@likas2003]. Therefore, to explore whether geographically cohesive clusters appear within our semantic embeddings, we generate hierarchical clusters, which are non-geographically bounded, using agglomerative clustering. This clustering method allows for the optimal number of clusters to be determined automatically, which was determined to be 3. These clusters were visualised geographically, to examine whether geographically cohesive groupings occurred. The proportion of clusters present within each administrative region (RGN)^[The highest tier of sub-national division in England. For Scotland and Wales we use the full national extents.] in Great Britain was also plotted to determine whether clusters appeared to correlate with administrative boundaries.

To quantify the level of spatial autocorrelation that our embeddings exhibit, we consider the Moran's I metric, which identifies the spatial relationship between each observation and its geographic neighbours [@anselin1995;@rey2023]. Moran's I values are generated based on the strength of correlation between values and the aggregate values of their geographic neighbours, known as their spatial lag. Higher Moran's I values therefore denote a stronger spatial autocorrelation. Given that Moran's I analysis requires univariate data, we explore global spatial autocorrelation of our semantic footprints UMAP decomposed into two dimensions, and plot both dimensions against their spatial lag, giving two distinct global Moran's I values.

We then consider how localised levels of high spatial autocorrelation may be identified through a Local Indicators of Spatial Autocorrelation (LISA) analysis. Instead of single global values, LISA analysis determines whether each unique LAD polygon exhibits a significant level of spatial autocorrelation, and assigns a local Moran's I value for each.
 
It is important to note that the magnitude of our embeddings do not convey any definable information, values therefore only highlight differences in semantic information between regions, rather than importance. For example, an embedding value of 0 is not less important than a value of 1 or -1.

## Semantic Similarity

Following our analysis of LAD semantic footprints, we explore our semantic footprints from a top-down perspective, aggregating LADs into established large-scale RGNs across Great Britain, taking the mean of the collective semantic footprints. Each RGN is therefore represented by a single 768 dimension semantic footprint embedding. We then calculate the cosine similarity between each RGN embedding, demonstrating the level of inter-region semantic cohesion across Great Britain.

Cosine similarity is a common metric for comparing embeddings, as it is invariant to the magnitude of the vectors, and only considers the direction. This is important as the magnitude of embeddings is not meaningful, and only the direction of the vector conveys information. For example, the embedding for the 'South East' cannot be twice as important as the embedding for the 'North West'.

## Capturing National Identities through Text

To generate explainable characteristics of any geographically distinct semantic footprints generated in our analysis, we consider how a language model associates national identities with the semantic properties of text. In our approach we mirror qualitative data collection methodologies in political science research, where individuals are typically queried as to their chosen national identity [@haesly2005;@griffiths2022], instead generating the categorisations of comments by querying a large language model (LLM).

LLMs are pre-trained on a large corpus of natural language text, building representations of this text that emulate a human understanding of language. The underlying theory is that these representations capture the collective knowledge of humans that contributed the natural language text used to build them. Therefore, in addition to factual information, when posed with non-deterministic questioning, these models are able to contribute the biased information that is incorporated into their model weights.

Recent research has noted on the ability to perform zero-shot classification using LLMs, where class predictions may be made without the model ever having previously seen the labels [@wei2022a;@wei2022]. While research has considered the use of questionnaires to query the strength of national identities within the UK [@haesly2005;@griffiths2022], an LLM may instead be used. For example, an LLM may be questioned whether it personally feels a sequence of text appears to be 'British', 'English', 'Scottish', or 'Welsh'. Through this zero-shot classification, we are able to determine the strength of national identity associated with each region in our work, to examine whether this appears to correlate with any cohesion between the semantic footprints that we generate. Importantly, we are also able to generate confidence values from the chosen LLM, allowing for the strength of these national identities to be captured. 

Semantic information within our comments is expected to capture both explicit information contributed by users; for example stating 'London is a British city', in addition to implicit semantic information that exists within language. For example the phrase 'bonnie Scotland' may suggest a strong identity due to the inclusion of Scottish slang^[See 'Scottish English' or 'Scots'; [@stuart-smith2008]]. Unlike our semantic footprints, we do not mask place name mentions in these embeddings, enabling the model to make its own decisions regarding place name mentions.

To identify regional identities through semantic information, we build on the emergent properties of large language models, which enable a task known as 'Zero-Shot Classification'. This allows models to predict a class that was not seen during training, by generating a prompt that contains the labels required. For this task we select the `typeform/distilbert-base-uncased-mnli` model^[https://huggingface.co/typeform/distilbert-base-uncased-mnli], which is tailored towards zero-shot classification, therefore generating slightly different embeddings compared with those used for our semantic footprints. For our task the following gives an example prompt with a portion of a comment taken from our corpus, where the Scottish colloquial slang 'gonnae' is used:

```
Classify the following input text into one of the following four categories:
[British, English, Scottish, Welsh]

Input Text: My favourite was in Livingston: 'Rab, I'm gonnae find you.'
```

The output would then be given as a sequence of confidence values for each label:

```
'labels': ['Scottish', 'British', 'Welsh', 'English']
'scores': [0.761, 0.144, 0.052, 0.043]
```

# Results {#sec-results}


In [6]:
#| label: fig-clusters
#| fig-cap: Semantic footprints associated with each LAD corpus coloured by hierarchical agglomerative clusters where $K=3$. (a) Footprints UMAP decomposed into two dimensions. (b) Proportion of clusters by RGN. (c) Geographic location of clusters.

from paper.figures import plt_place_vectors, process_embeddings

plt_place_vectors(lad_embeddings, regions)
plt.show()

<Figure size 2400x1800 with 3 Axes>

@fig-clusters (a) shows clusters of LAD transformer embeddings UMAP decomposed into two dimensions, indicating embeddings that share similar semantic properties. These clusters appear to broadly correlate with three distinct regions within Great Britain, where cluster 0 most closely identifies with England, 1 with London and surrounding areas, and 2 with Scotland and Wales (@fig-clusters (b-c)). The few areas that appear as cluster 0 in Wales and Scotland are major urban centres like Cardiff, Glasgow, and Edinburgh. Overall these clusters appear to be geographically restricted, and even broadly conform with administrative regions like the Welsh and Scottish borders.

These findings appear to share similarities with past work that has observed strong 'boundary effects', where lexical similarity between geotagged Tweets often correlates with administrative boundaries [@li2021;@bailey2018;@arthur2019;@yin2017a]. Our embeddings also exhibit the general geographically coherent patterns that have been observed in geographical lexical variations in social media [@russ2012;@doyle2014;@huang2016;@goncalves2014;@perez2019;@arthur2019;@eisenstein2014]. Notably, unlike dialects, where a geographic component is expected, the geographic association of our general semantic embeddings has not been demonstrated in past work. Results therefore demonstrate that despite no pre-existing geographic information like geotags or place names, general text associated with locations appears to embed a geographic component. The geographic coherence in our results is particularly strong at the borders of Scotland and Wales, which conforms with our hypothesis that the vernacular geography that exists within social media text embeds components that contribute to the strength of national identities [@haesly2005].

As noted however, major cities in Wales and Scotland Glasgow, Edinburgh and Cardiff share a cluster with English LADs rather than their respective country, suggesting that these locations are more semantically connected with the rest of Great Britain. This observation mirrors the results of work that considered co-occurring locational mentions between cities, where shared city mentions in text often appear irrespective of distance, and across administrative borders [anonymised]. This deviation from the relative semantic isolation of Scotland and Wales from England appears to be reflective of the nature of major cities, given they tend to share stronger physical geographic connections across a larger geographic scope, and more influential cultural connections compared with rural areas, captured in our work through shared semantic traits.

Cluster 1 presents in areas surrounding London and suggests distinctiveness of this region relative to the rest of Great Britain. This is interesting given London's extensive connectivity relative to the rest of the country, and the general sense of strong association with other cities, given it is the capital city [anonymised]. Our results therefore suggest that despite London's importance nationally, semantic information is able to capture a deeper context that dissociates it from other regions. This effect may be due to factors unique to London, for example its prominence globally, influencing both tourism and business external to the United Kingdom, which alter the cultural landscape of the city. The isolated characteristics of London are particularly observable through its economic differences, where high costs of living have generated the need for a 'London weighting'^[https://en.wikipedia.org/wiki/London_weighting] of salaries [@hirsch2016].

The following section formalises the level of geographic coherence that the embeddings exhibit, and highlights the key locations that drive the relationship between text and geography.

## Moran's I Analysis


In [7]:
#| label: fig-morans
#| fig-cap: 'Moran''s I Plot: LAD embeddings decomposed into 2 dimensions and standardised against their spatial lag.'

from paper.figures import plt_morans, process_moran

lad_embeddings, w, explained = process_moran(lad_embeddings)
moran1, moran2, sim = plt_morans(lad_embeddings, w, explained)
plt.show()

<Figure size 1650x1050 with 1 Axes>

To quantify whether our embeddings demonstrate spatial autocorrelation, we consider the Moran's I metric, which identifies the spatial relationship between each observation and its geographic neighbours [@anselin1995]. Given that this analysis requires univariate data, we explore global spatial autocorrelation of our UMAP decomposed embeddings computing the spatial lag for both dimensions. On @fig-morans, we plot both values for each LAD semantic footprint in Great Britain, against the spatial lag of these values. A higher correlation between the semantic footprints values and their spatial lag indicates a stronger level of global spatial autocorrelation, resulting in a higher Moran's I value. @fig-morans shows a positive correlation between the PCA decomposed embedding values and their spatial lag, resulting in Moran's I values of `{python} f"{moran1.I:.2f}"` and `{python} f"{moran2.I:.2f}"`. This indicates a reasonably strong spatial autocorrelation with both embedding dimensions, confirming that semantic footprints are typically more similar between nearby locations. While the Moran's I values for both dimensions are similar, their cosine similarity is negative (`{python} f"{sim.item():.2f}"`), meaning these two decomposed dimensions capture distinctly different semantic traits.

While spatially coherent results have been demonstrated from the perspective of dialects on social media [@russ2012;@doyle2014;@huang2016;@goncalves2014;@perez2019;@arthur2019;@eisenstein2014], we have demonstrated that this phenomenon can also be captured from general semantic information. Notably, while dialects have always been considered to have strong geographical grounding [@trudgill2004], it is more surprising that general semantic information regarding locations similarly exhibits this relationship.


In [8]:
#| label: fig-lisa
#| fig-cap: 'Local Indicators of Spatial Auto-correlation (LISA). (a/d) 1 dimensional embedding values. (b/e) Local Moran''s I values ($Is$). (c/f) LISA HH and LL significant values ($p<0.05$), both are included as the value of embeddings do not convey information.'

from paper.figures import plt_lisa

plt_lisa(lad_embeddings, [0, 1])
plt.show()

<Figure size 3600x2400 with 10 Axes>

To explore local indicators of spatial autocorrelation (LISA) we plot each decomposed embedding on @fig-lisa (a/d), each local Moran's I value on (b/e) and all significant ($p<0.05$) HH and LL LISA quadrants on (c/f). Note that only selecting significant $p$ values on @fig-lisa (c/f) ensures that no regions are included that have values that could demonstrate autocorrelation even if randomly distributed geographically. From @fig-lisa (c/f), we can see that notable large areas with significant levels of spatial correlation include;

* Scotland
* Wales
* London and surrounding LADs
* the South West; towards Cornwall

As demonstrated by the low cosine similarity between our UMAP embeddings, they appear to capture distinctly different semantic information. London for example only appears in dimension 0, while dimension 1 captures broader spatial autocorrelation across Scotland and Wales. In Scotland we can see that from both LISAs, Glasgow and Edinburgh represent areas of HL/LH, where semantic information in these cities is not the same as surrounding LADs, an effect that is also captured in some LADs surrounding London. England overall appears to be a less semantically cohesive country based on this analysis, where most LADs do not contribute significant levels of spatial autocorrelation.

These results again demonstrate geographic cohesion between semantic footprints, which notably appear to correspond with the national boundaries of Wales and Scotland. This mirrors the observations of past work where dialect differences appeared to correlate with administrative boundaries [@li2021;@bailey2018;@arthur2019;@yin2017a]. In addition to Wales and Scotland, we have also identified a notable grouping in the South West, which potentially reflects the Cornish identity [@deacon2007], as well as a grouping associated with London.


## Semantic Similarity and Identity


In [9]:
#| label: fig-similarity
#| fig-cap: Scaled cosine similarity of embeddings for administrative regions across the UK. Higher values indicate greater cosine similarity. Regions shown in descending order by mean cosine similarity value.

from paper.figures import plt_similarity

plt_similarity(region_embeddings)
plt.show()

<Figure size 2400x3000 with 13 Axes>

Given the regions highlighted as having strong spatial autocorrelation in their semantic footprints appear to broadly conform with the administrative regions of Wales, Scotland, and London, we examine these footprints from a top-down analysis using pre-defined larger scale aggregations.

@fig-similarity compares the cosine similarity between each RGN embedding, allowing for inter-regional cohesion to be explored. The North West has the overall highest level of cosine similarity, displaying comparatively high similarity with most regions across England, excluding London. London has the lowest overall similarity, only sharing positive cosine similarity values with the South and South East of England. As expected, Scotland and Wales have low overall cosine similarity values, with Wales sharing even lower similarity with respect to London and the South East compared with Scotland. Mean values show clearly that the least cohesive regions appear to be London, Wales, and Scotland, three regions that are also those with the strongest levels of spatial autocorrelation. 

Excluding London, the North East is the region in England with the lowest overall cosine similarity with the rest of Great Britain. This is perhaps reflective of distinct differences with this region, for example the distinctly lower gross value added (GVA) compared with other regions [@fenton2018], or the general sense of strong identity that is often noted by residents [@middleton2008]. Alternatively, the North West is home to nationally influential urban conurbations, especially between Manchester and Liverpool [@oguz2022], likely generating the highest overall semantic similarity of this region compared with the rest of the UK. Comparatively, the East of England, South East and London are neighbouring regions that share high similarities with each other, but exhibit low similarity with the rest of Great Britain, suggesting there are semantic components that distinguish this region of the country from the rest. There is a slightly higher mean similarity with respect to Scotland compared with Wales, due to higher similarities with regions in England, like the North West and South East. Major urban centres in Scotland are relatively well connected to Great Britain through rail routes, and Edinburgh and Glasgow are historically important UK cities, captured by their distinct difference in embedding values during the spatial autocorrelation analysis. This factor likely increases the cosine similarity of Scotland with regions in England, while Wales in this sense is less directly associated with the rest of the UK.

To determine whether regional identities generated by a large language model aligns with these semantically isolated regions in our analysis, we plot the distribution of regional identities identified through our zero-shot classification on @fig-identity.

Across each region, the 'English' identity is always lower than 'British', suggesting that regions within England are typically more strongly associated with the United Kingdom^[Note that despite etymologically relating to 'Great Britain', the term 'British' refers to 'belonging to or relating to the United Kingdom of Great Britain and Northern Ireland'] than solely England. Unlike English regions however, comments relating to both Scottish and Welsh locations are more strongly associated with their respective nationalities. However, comments relating to Welsh locations appear on average to have stronger confidence values with respect to the British classification, compared with Scottish locations. Similar observations have been captured from qualitative interviewing, where Welsh residents similarly appear to more strongly associate themselves with the British identity, compared with Scottish residents [@carman2014;@llamas2014;@llamas2009;@haesly2005]. Of the English regions, London has a distinctly higher average confidence value of both British and English identities compared with all other regions. Notably given the semantic footprints for Scotland, Wales, and London also have the lowest overall cosine similarity values, these differences in generated identity compared with other regions are a likely component in their semantic differences.


In [10]:
#| label: fig-identity
#| fig-cap: 'Zero Shot classification of each corpus into regional identities; [B]ritish, [E]nglish, [S]cottish, [W]elsh. Values show mean confidence value across each comment, lines indicate standard error. Descending order by [B]ritish confidence. The dashed line separates English regions from Scotland and Wales.'

from paper.figures import plt_zero_shot


plt_zero_shot(Paths.PROCESSED / "places_zero_shot.parquet")
plt.show()

<Figure size 1650x1050 with 1 Axes>

## General Observations

Unlike typical representations of the North-South divide within England [@jewell1994], semantic differences appear to be influenced primarily by proximity to London. Unlike typical representations of this divide, the South West of England therefore appears to be distinct from the South East, with a stronger association with the North. South Eastern regions however do share lower similarity to the Midlands and North of England, which conforms with a typical view of the English North-South divide.

In a similar sense, Scotland and Wales demonstrate distinctly more cohesive semantic properties compared with England, where groupings of high spatial autocorrelation are constrained to smaller regions, like London. In traditional linguistic research, the spoken dialect across England is known to vary considerably [@chambers1998;@knowles1973;@deacon2007;@mackenzie2022], which itself captures the distinct social differences, and the localised identities that exist across geographic space. Instead, the high cohesion within Wales and Scotland appears to capture the sense of national identity that these constituent countries exhibit in our analysis, and is a common qualitative observation in political science research [@haesly2005;@carman2014].

As demonstrated in past work that has examined both physical and non-physical networks, our observed semantic information similarly appears to correlate with pre-defined administrative boundaries, particularly the national boundaries of Scotland and Wales [@li2021;@bailey2018;@arthur2019;@yin2017a]. The distinct difference in footprints between each constituent country in the UK conforms with the idea that vernacular geography captures a sense of identity, given our zero-shot classification demonstrates distinct nationalities between Scotland and Wales, unlike English regions where the generated national identity is typically considered British rather than English. Notably however, the slightly stronger British identity within Wales has been observed previously through qualitative interviewing [@haesly2005;@carman2014], suggesting that even the nuanced properties of text appear to correlate with the true perceptions of individuals. It is also worth noting that, given the exclusion of place names in our embeddings, these distinct differences are not simply the result of differences in place names (e.g. place names in Wales are distinct from England), which may have influenced the results of past lexical work.

Despite most locations across Scotland and Wales appearing disconnected with the rest of the UK, major cities like Glasgow and Edinburgh are more semantically similar, a distinction that was also observed when the distance decay of locational co-occurrences in text was examined [anonymised]. This suggests that these cities do appear to be typically more semantically connected with the UK, regardless of geographic distance and borders, while other locations typically share semantic properties within the same nation, captured through stronger spatial autocorrelation.

Internal migration patterns within the UK are primarily influenced by family ties, rather than economic factors, employment, or education [@thomas2019]. The observations made in our work demonstrate that this sense of belonging to regions influences the geographically cohesive nature of our semantic footprints. While populations have the ability to distribute evenly across geographic space, they are often reluctant to move far. Local inhabitants within regions develop an identity associated with their home region, traditionally captured in language through dialect variation, and demonstrated in our work through broader semantic associations, which embed contextual meaning, incorporating the cultural variation of regions.

# Conclusion {#sec-conclusion}

Our paper demonstrates a new method to compare aggregate semantic information for local authorities and regions within the UK, from Reddit comments that mention geoparsed locations, which we name semantic footprints. When examining the semantic footprints of each LAD in the UK, we find that geographically cohesive clusters appear, with significant levels of spatial autocorrelation. Clusters broadly conform with the national borders of Scotland and Wales, while London also appears to be semantically distinct from the rest of England. Through an examination of generated national identities associated with each region, we find that these distinct geographic groupings are a likely result of associated identities, which are generated through general associations captured through the vernacular geography of all users in our social media corpus.

Geoparsing methods contribute an additional geographic dimension to non-geotagged social media data, allowing for a much larger repository of informal natural language geographic text to be used for research. Future work may consider the use of Reddit comment data to derive notable urban areas of interest [@chen2019]. This area of research in particular would benefit from methodologies focussing on the extraction of fine-grained locations from text, which at present is a challenging task [@han2018].

# References {-}