<h1>Call-in Corpora: Collocations and Concordances</h1>



<h2>Introduction</h2>

Corpus data taken from talkback radio transcriptions offer both opportunities and challenges for analyzing natural language. As a form of language use, Talkback Radio (TR) occupies an intermediate position between purely spontaneous conversation on the one hand, and formal public address on the other: "while talkback radio is addressed at a public audience it is frequently unplanned, improvised or 'ad hoc' and participants must communicate in real-time" (Bednarek: 2014, 4-5).

<h3>Opportunities</h3>

The semi-spontaneous character of this form of language therefore offers an interesting opportunity to study vernacular patterns of language use. TR conversations are predominantly spoken in the local vernacular, offering an opportunity to analyze locally distinct patterns of usage. At the same time, TR must be comprehensible to a broad population of listeners at a regional, and in some cases national, level. It therefore doesn't present the same challenges as truly spontaneous natural language use in everyday speech, which can be prohibitively idiolectic or hyper-localized.
<br><br>
TR corpora also represent a range of dialogic speech situations, recording:

1. Dialogue between hosts and callers.
2. Dialogue between hosts and guests.
3. Dialogue between co-hosts.
4. <i>Pseudo-dialogue</i> between the singular host and the passive listening audience.

This dialogic diversity affords opportunities for analyzing varying patterns of usage depending on conversational contexts.

<h3>Analysis: Data and Scope</h3>

This preliminary textual analysis will consider 29 TR transcripts of Australian radio talk: 14 recordings from public broadcaster ABC National Radio (NAT),  broadcasts to eastern Australian (ABCE) and to southern and western Australia (ABCNE); as well as 15 recordings from commercial broadcasts to eastern Australia (COME) and southern and western Australia (COMNE).
<br>
Our analysis focusses primarily on frequently occurring terms and collocations, drawing on the work of previous scholars like Bednarek (2014), who analyzed 'markers of linguistic involvement' in Australian Talkback Radio (ART). In general, we take a 'corpus-wide' approach, examining patterns of usage that are common across regions, and across the public-commercial divide. We employ a quantitative empirical analysis to identify features of interest, alongside a more qualitative analysis of phrases' semantic meanings and pragmatic functions. As this brief introduction hopefully makes clear, our analysis is far from exhaustive: much more remains to be learned and analyzed from these data.

<h3>Challenges</h3>

TR text transcripts are frequently raw and messy, for instance preserving misfluencies like 'uh', 'um' and 'oh', which presents preprocessing challenges. Transcripts are generally highly unstructured, with little built-in annotation of features. The ART corpus we use here is characteristic in these respects.
<br><br>
Our analysis therefore gives some attention to the methodological questions involved in preprocessing RT transcript data, such as stopword curation and tokenization strategies.


# 1. Data Assemblage

## 1.1 Libraries

In [302]:
# Libraries

!pip install cs50
from cs50 import get_string

import warnings
from google.colab import drive
import os

import numpy as np
import pandas as pd

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

import spacy



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [303]:
warnings.filterwarnings(action='ignore')

## 1.2 Importing Data

Running the cell below will mount your Google Drive to the notebook. Running this notebook requires that the files you wish to process for text analysis are stored in a folder in your Google Drive.

In [304]:
# Mount Google Drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Running the cell below will prompt you to input a path to the folder containing the files you wish to submit for text analysis. The files must be formatted as .txt. Input the path as simple text without quotation marks.

In [305]:
folder_path = get_string(
    """
    Please input a path to the folder to submit for text analysis.
    This notebook was designed to process the Australian Radio Talkback (ART) dataset.
    """
)


    Please input a path to the folder to submit for text analysis.
    This notebook was designed to process the Australian Radio Talkback (ART) dataset.
    /content/drive/My Drive/Data Science/Spring 2024/ANLP/AT1 dataset_AusRadioTalkback


In [306]:
# ##  path for programmer

# folder_path = '/content/drive/My Drive/Data Science/Spring 2024/ANLP/AT1 dataset_AusRadioTalkback'

In [307]:
# Build a dictionary containing all the contents of the folder containing the AT1 dataset
# The code in this cell was written with the assistance of OpenAI's ChatGPT 4.0
dataset = {}
for file_name in os.listdir(folder_path):
  file_path = os.path.join(folder_path, file_name)
  if os.path.isfile(file_path) and file_name.endswith('.txt'):
    with open(file_path, 'r') as file:
      content = file.read()
      dataset[file_name] = content

for file_name, content in dataset.items():
  print(f"Content of {file_name}:")
  print("")
  print(content)
  print("-----")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 Uh have you uh started the menopause as yet.
 Oh yes yeah.
 You have yeah. It it coulda been produced by the um surgery uh obviously I mean you can ha have interference with uh the blood supply to the remaining ovary under these conditions that might have uh faded away a bit earlier that it woulda done under normal conditions.
 Okay. Well the second issue I wanna bring up is um the issue of alternative therapy it actually does work um because I've refused to go on H R T so I've been experimenting for qu well ever since I've been forty-two . Um but I have found that of course it's only short term and it may last for three or four months and then y'know you you look at then you have to look at self medicating. And that's another worry for me is is ha I'm on three alternative therapy medications now. Uh I don't know who to turn to to say yes that's too much or no y'know you shouldn't be doing that because I I don't know if 

## 1.3 Constructing DataFrame

In [308]:
df = pd.DataFrame.from_dict(data=dataset, orient='index')

In [309]:
# resetting index and renaming columns
df_clean = df.reset_index(names='filename')
df_clean.rename(mapper={0: 'transcript'}, axis=1, inplace=True)

## 1.4 General Data Overview

In [310]:
df_clean

Unnamed: 0,filename,transcript
0,ABCNE1-plain.txt,A very good afternoon to you Roly.\n Good aft...
1,ABCE1-plain.txt,Thanks for that John Hall now John Hall will ...
2,ABCE4-plain.txt,Uh blue-tongues'd be unlikely to eat them be...
3,ABCE2-plain.txt,Ah look l Les Pete.\n.\n Simon.\n G'day Peto....
4,ABCE3-plain.txt,If you haven't been with us before this how i...
5,1_Dataset Description.txt,Name\n\nAustralian Radio Talkback\n\nDescripti...
6,COME2-plain.txt,Good morning everyone and welcome to a very f...
7,NAT8-plain.txt,Hello and welcome to the Chatroom with Gaby t...
8,COME5-plain.txt,Here's Sharina's Saturday Nights the positive...
9,COMNE2-plain.txt,Good afternoon Howard Sattler with you welcom...


In [311]:
# setting aside the dataset description since it's not comparable to other files
dataset_description = df_clean.loc[df_clean['filename'] == '1_Dataset Description.txt', 'transcript']
df_clean = df_clean.drop(5)
print(df_clean.filename)

0     ABCNE1-plain.txt
1      ABCE1-plain.txt
2      ABCE4-plain.txt
3      ABCE2-plain.txt
4      ABCE3-plain.txt
6      COME2-plain.txt
7       NAT8-plain.txt
8      COME5-plain.txt
9     COMNE2-plain.txt
10      NAT7-plain.txt
11     COME7-plain.txt
12      NAT5-plain.txt
13    COMNE1-plain.txt
14      NAT6-plain.txt
15      NAT2-plain.txt
16      NAT4-plain.txt
17    COMNE3-plain.txt
18     COME3-plain.txt
19    ABCNE2-plain.txt
20     COME4-plain.txt
21    COMNE4-plain.txt
22      NAT1-plain.txt
23     COME8-plain.txt
24     COME1-plain.txt
25    COMNE6-plain.txt
26    COMNE7-plain.txt
27    COMNE5-plain.txt
28      NAT3-plain.txt
29     COME6-plain.txt
Name: filename, dtype: object


We will truncate the common "-plain.txt" element in the filenames for easier viewing.

In [312]:
import re

def clean_filename(filename):
  cleaned_filename = re.sub(r"-plain.txt", "", filename)
  return cleaned_filename

df_clean['filename'] = df_clean['filename'].apply(clean_filename)
df_clean['filename']

Unnamed: 0,filename
0,ABCNE1
1,ABCE1
2,ABCE4
3,ABCE2
4,ABCE3
6,COME2
7,NAT8
8,COME5
9,COMNE2
10,NAT7


# 2. Simple Feature Engineering

## 2.1 'Commercial' Flag

Here we are creating a simple boolean feature to flag whether or not a transcript is taken from a Commercial station (com==True) or a Public station (com==False)

In [313]:
df_clean['commercial'] = df_clean['filename'].str.contains("COM")
df_clean['commercial']

Unnamed: 0,commercial
0,False
1,False
2,False
3,False
4,False
6,True
7,False
8,True
9,True
10,False


## 2.2. Region

Now we create a feature that indicates the region in which the programme was broadcast: National, East, or Southwest ('NE': 'Not East')

In [314]:
def determine_region(filename):
  if 'NAT' in filename:
    return 'national'
  elif 'NE' in filename:
    return 'southwest'
  elif 'ABCE' in filename or "COME" in filename:
    return 'east'
  else:
    return 'unknown'

df_clean['region'] = df_clean['filename'].apply(determine_region)
df_clean['region']

Unnamed: 0,region
0,southwest
1,east
2,east
3,east
4,east
6,east
7,national
8,east
9,southwest
10,national


## 2.3 Wordcounts

Here we create a simple wordcount feature to indicate the length of the respective transcripts.

In [315]:
def count_words(text):
  wc = len(text.split())
  return wc

df_clean['wordcount'] = df_clean['transcript'].apply(count_words)
df_clean['wordcount']

Unnamed: 0,wordcount
0,4722
1,9209
2,3694
3,10694
4,6560
6,7340
7,17184
8,18947
9,6791
10,6009


# 3. Preprocessing

## 3.1 Tokenization

Here we <i>tokenize</i> our transcripts into constituent words, contracted elements and punctuations, and pass the tokens to a new column 'raw_tokens'. The tokens are 'raw' in that we are not yet filtering out stopwords.

In [316]:
from nltk.tokenize import word_tokenize

df_clean['raw_tokens'] = df_clean['transcript'].apply(word_tokenize)
df_clean['raw_tokens']

Unnamed: 0,raw_tokens
0,"[A, very, good, afternoon, to, you, Roly, ., G..."
1,"[Thanks, for, that, John, Hall, now, John, Hal..."
2,"[Uh, blue-tongues, 'd, be, unlikely, to, eat, ..."
3,"[Ah, look, l, Les, Pete, ., ., Simon, ., G'day..."
4,"[If, you, have, n't, been, with, us, before, t..."
6,"[Good, morning, everyone, and, welcome, to, a,..."
7,"[Hello, and, welcome, to, the, Chatroom, with,..."
8,"[Here, 's, Sharina, 's, Saturday, Nights, the,..."
9,"[Good, afternoon, Howard, Sattler, with, you, ..."
10,"[Five, A, M, in, New, York, hey, ., There, 's,..."


Next we tokenize by sentences, passing the output to the 'sentences' column.

In [317]:
from nltk.tokenize import sent_tokenize

df_clean['sentences'] = df_clean['transcript'].apply(sent_tokenize)
df_clean['sentences']

Unnamed: 0,sentences
0,"[ A very good afternoon to you Roly., Good aft..."
1,[ Thanks for that John Hall now John Hall will...
2,[ Uh blue-tongues'd be unlikely to eat them b...
3,"[ Ah look l Les Pete., ., Simon., G'day Peto.,..."
4,[ If you haven't been with us before this how ...
6,[ Good morning everyone and welcome to a very ...
7,[ Hello and welcome to the Chatroom with Gaby ...
8,[ Here's Sharina's Saturday Nights the positiv...
9,[ Good afternoon Howard Sattler with you welco...
10,"[ Five A M in New York hey., There's gotta be ..."


## 3.2 Stopwords and Filtration

Here we filter out 'stopwords' from our tokens, which means mechanical words of little analytic interest. We are adapting NLTK's predefined stopwords bank.

In [318]:
from nltk.corpus import stopwords

stop_words = stopwords.words("english")
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

We will be interested in the use of pronouns in our analysis, so we remove these from the stopwords bank.

In [319]:
go_words =   [
    'i',
  'me',
  'my',
  'myself',
  'we',
  'our',
  'ours',
  'ourselves',
  'you',
  "you're",
  "you've",
  "you'll",
  "you'd",
  'your',
  'yours',
  'yourself',
  'yourselves',
  'he',
  'him',
  'his',
  'himself',
  'she',
  "she's",
  'her',
  'hers',
  'herself',
  'it',
  "it's",
  'its',
  'itself',
  'they',
  'them',
  'their',
  'theirs',
  'themselves']

In [320]:
stopwords_boutique = []
for word in stop_words:
  if word not in go_words:
    stopwords_boutique.append(word)

stopwords_boutique

['what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'few',
 'more',
 'most',
 'other',
 'some',
 'such',
 'no',
 'nor',
 'not',
 'only',
 'own',
 'same',
 'so',
 'than',
 'too',
 'very',
 's',
 't',
 'can',
 'will',
 'just',
 'don',
 "don't",
 'should',
 "should've",
 'now',
 'd',
 'll',
 'm',
 'o',
 're',
 've',
 'y',
 'ain',
 'aren',
 "aren't",
 'couldn',
 "couldn't",
 'didn',
 "di

stopwords for the whole program are updated here.

In [321]:
extra_stopwords = [
    'uh',
    'um',
    'yeah',
    'oh'
]

In [322]:
stop_words.extend(extra_stopwords)
stop_words = set(stop_words)

In [323]:
# Function to filter out unwanted tokens, including any containing punctuations (possessive and contractions) and any misfluencies
def filter_tokens(raw_tokens):
  filtered_tokens = [t.lower() for t in raw_tokens if
                     t.lower() not in puncts and
                     t.lower() not in stop_words and
                     t.lower().isalpha()]
  return filtered_tokens

In [324]:
df_clean['filtered_tokens'] = df_clean['raw_tokens'].apply(filter_tokens)
df_clean['filtered_tokens']

Unnamed: 0,filtered_tokens
0,"[good, afternoon, roly, good, afternoon, sir, ..."
1,"[thanks, john, hall, john, hall, listening, ne..."
2,"[unlikely, eat, good, old, hemidactylus, asian..."
3,"[ah, look, l, les, pete, simon, peto, simo, le..."
4,"[us, functions, jurate, sasnaitis, joins, us, ..."
6,"[good, morning, everyone, welcome, foggy, sort..."
7,"[hello, welcome, chatroom, gaby, tonight, worl..."
8,"[sharina, saturday, nights, positive, vibe, go..."
9,"[good, afternoon, howard, sattler, welcome, dr..."
10,"[five, new, york, hey, got, ta, something, sal..."


In [325]:
stop_words = [word.lower() for word in stop_words]
df_clean['filtered_frequency_ranks'] = df_clean['filtered_tokens'].apply(lambda tokens: rank_word_frequencies(tokens, stop_words))
df_clean[['filename', 'filtered_frequency_ranks']]

Unnamed: 0,filename,filtered_frequency_ranks
0,ABCNE1,"[(0, one), (1, like), (2, word), (3, okay), (4..."
1,ABCE1,"[(0, well), (1, yes), (2, sort), (3, like), (4..."
2,ABCE4,"[(0, okay), (1, get), (2, bird), (3, good), (4..."
3,ABCE2,"[(0, well), (1, yes), (2, would), (3, got), (4..."
4,ABCE3,"[(0, think), (1, really), (2, book), (3, like)..."
6,COME2,"[(0, well), (1, good), (2, get), (3, think), (..."
7,NAT8,"[(0, people), (1, like), (2, think), (3, well)..."
8,COME5,"[(0, got), (1, like), (2, okay), (3, really), ..."
9,COMNE2,"[(0, say), (1, think), (2, got), (3, would), (..."
10,NAT7,"[(0, like), (1, got), (2, good), (3, well), (4..."


## 3.3 Lemmatization

Here we <i>lemmatize</i> our data, reducing word tokens to their dictionary entries.

In [326]:
def lemmatize(tokens, allowed_postags=["NOUN", "ADJ", "VERB"]):
  # rejoin the input tokens into a single text at whitespaces
  text = " ".join(tokens)
  doc = nlp(text)
  lemmas = []

  for token in doc:
    if token.pos_ in allowed_postags:
      lemmas.append(token.lemma_)

  return lemmas

In [327]:
df_clean['lemmas'] = df_clean['filtered_tokens'].apply(lemmatize)
df_clean['lemmas']

Unnamed: 0,lemmas
0,"[good, afternoon, roly, good, afternoon, sir, ..."
1,"[thank, listen, next, hour, angus, stewart, ta..."
2,"[eat, good, old, asian, tend, stick, wall, goo..."
3,"[l, pete, peto, simo, good, morning, gentleman..."
4,"[function, join, couple, week, month, book, cl..."
6,"[good, morning, welcome, foggy, sort, strange,..."
7,"[welcome, chatroom, tonight, day, talk, refuge..."
8,"[night, positive, vibe, good, evening, good, e..."
9,"[good, afternoon, sattler, drive, program, p, ..."
10,"[get, salt, pervert, super, request, name, tim..."


# 4. Basic Analysis

## 4.1 Word Frequencies

### 4.1.1 Stopword-Permissive

Here we use a narrower bank of stopwords, retaining personal pronouns and misfluencies.

In [328]:
import string
from collections import Counter

puncts = list(string.punctuation)

# this function takes a list of tokens as input and returns a list of tuples of the form (n, word) where n represents a frequency rank (0 being most frequent and len(list) being least frequent)
def word_frequencies(tokens):
  wc_dict = Counter(tokens)
  common_words = sorted(wc_dict, key=wc_dict.get, reverse=True)
  cw_not_puncts = [
      w for w in common_words if
      w not in puncts
  ]
  most_frequent_words = list(enumerate(cw_not_puncts))
  return most_frequent_words

In [329]:
import string
from collections import Counter

puntcs = list(string.punctuation)

def rank_word_frequencies(tokens, s_words):
  s_words = set(word.lower() for word in s_words)
  words = [w.lower() for w in tokens if
           w.lower() not in puncts
           and w.lower() not in s_words]
  wc_dict = Counter(words)
  by_frequency = sorted(wc_dict, key=wc_dict.get, reverse=True)
  frequency_ranks = list(enumerate(by_frequency))
  return frequency_ranks

df_clean['frequency_ranks'] = df_clean['raw_tokens'].apply(rank_word_frequencies, s_words=stopwords_boutique)
df_clean[['filename', 'frequency_ranks']]

Unnamed: 0,filename,frequency_ranks
0,ABCNE1,"[(0, you), (1, i), (2, uh), (3, it), (4, 's), ..."
1,ABCE1,"[(0, uh), (1, it), (2, i), (3, 's), (4, you), ..."
2,ABCE4,"[(0, it), (1, i), (2, you), (3, 's), (4, they)..."
3,ABCE2,"[(0, it), (1, 's), (2, you), (3, i), (4, uh), ..."
4,ABCE3,"[(0, i), (1, it), (2, 's), (3, you), (4, she),..."
6,COME2,"[(0, 's), (1, it), (2, you), (3, i), (4, uh), ..."
7,NAT8,"[(0, i), (1, 's), (2, it), (3, um), (4, uh), (..."
8,COME5,"[(0, you), (1, i), (2, 's), (3, it), (4, 're),..."
9,COMNE2,"[(0, i), (1, uh), (2, 's), (3, it), (4, you), ..."
10,NAT7,"[(0, you), (1, i), (2, 's), (3, it), (4, uh), ..."


In [330]:
# corpus frequencies
corpus_tokens = np.concatenate(df_clean['raw_tokens'].values).tolist()
corpus_frequency_ranks = rank_word_frequencies(corpus_tokens, s_words=stopwords_boutique)
print("Top 10 most frequent words in whole corpus:")
for tuple in corpus_frequency_ranks[:10]:
  print(tuple)

Top 10 most frequent words in whole corpus:
(0, 'i')
(1, 'you')
(2, 'it')
(3, "'s")
(4, 'uh')
(5, 'um')
(6, 'they')
(7, 'we')
(8, "n't")
(9, "'re")


- We can see that the most frequently occurring words are the personal pronouns <b>you</b> and <b>I</b>.
- This finding corroborates that of Monika Bednarek in her 2014 paper 'Involvement in Australian Talkback Radio--A Corpus Linguistic Investigation', published in the <i>Australian Journal of Linguistics</i>, 34:1, 3-23, DOI: https://www.tandfonline.com/action/showCitFormats?doi=10.1080/07268602.2014.875453.
In her paper, Bednarek finds a relatively high frequency of personal pronouns 'you' and 'I', which she analyzes as "markers of involvement". These markers indicate a situation of immediacy between speakers typical of face-to-face interactions. Bednarek finds Talkback Radio corpora of interest because they represent speech situations which are intermediate between formal written speech and casual face-to-face interactions, what "McCarthy and O'Keeffe (2003) call ... a 'pseudo-conversational context'" (quoted in Bednarek, p.5).
- We also observe frequent occurences of <i>misfluencies</i> like 'uh'. This is an unsuprising consequence of talkback radio's being "frequently unplanned, improvised or 'ad hoc' (Hutchby 1996:59, quoted in Bednarek, 4-5).

### 4.1.2 Stopword-Restrictive

Now we will recalculate word frequencies with a more restrictive bank of stopwords, in the hope of finding clearer patterns.

In [331]:
filtered_corpus_tokens = np.concatenate(df_clean['filtered_tokens'].values).tolist()
filtered_corpus_frequency_ranks = rank_word_frequencies(filtered_corpus_tokens, s_words=stop_words)
print("Top 10 most frequent words in whole corpus:")
for tuple in filtered_corpus_frequency_ranks[:10]:
  print(tuple)

Top 10 most frequent words in whole corpus:
(0, 'well')
(1, 'like')
(2, 'think')
(3, 'got')
(4, 'good')
(5, 'one')
(6, 'get')
(7, 'yes')
(8, 'would')
(9, 'know')


Following filtration, we can start to observe a more interesting variation in word frequencies.

## 4.2 Sentence Lengths

In [332]:
def measure_sentences(sentences):
  sentence_lengths = [len(sentence) for sentence in sentences]
  avg_sentence_length = int(np.mean(sentence_lengths).round())
  return avg_sentence_length

In [333]:
avg_sentence_lengths = df_clean['sentences'].apply(measure_sentences)
df_clean.insert(7, 'avg_len_sentences', avg_sentence_lengths)
df_clean[['filename', 'avg_len_sentences']]

Unnamed: 0,filename,avg_len_sentences
0,ABCNE1,85
1,ABCE1,80
2,ABCE4,103
3,ABCE2,59
4,ABCE3,88
6,COME2,59
7,NAT8,106
8,COME5,63
9,COMNE2,134
10,NAT7,57


In [334]:
df_clean.head()

Unnamed: 0,filename,transcript,commercial,region,wordcount,raw_tokens,sentences,avg_len_sentences,filtered_tokens,filtered_frequency_ranks,lemmas,frequency_ranks
0,ABCNE1,A very good afternoon to you Roly.\n Good aft...,False,southwest,4722,"[A, very, good, afternoon, to, you, Roly, ., G...","[ A very good afternoon to you Roly., Good aft...",85,"[good, afternoon, roly, good, afternoon, sir, ...","[(0, one), (1, like), (2, word), (3, okay), (4...","[good, afternoon, roly, good, afternoon, sir, ...","[(0, you), (1, i), (2, uh), (3, it), (4, 's), ..."
1,ABCE1,Thanks for that John Hall now John Hall will ...,False,east,9209,"[Thanks, for, that, John, Hall, now, John, Hal...",[ Thanks for that John Hall now John Hall will...,80,"[thanks, john, hall, john, hall, listening, ne...","[(0, well), (1, yes), (2, sort), (3, like), (4...","[thank, listen, next, hour, angus, stewart, ta...","[(0, uh), (1, it), (2, i), (3, 's), (4, you), ..."
2,ABCE4,Uh blue-tongues'd be unlikely to eat them be...,False,east,3694,"[Uh, blue-tongues, 'd, be, unlikely, to, eat, ...",[ Uh blue-tongues'd be unlikely to eat them b...,103,"[unlikely, eat, good, old, hemidactylus, asian...","[(0, okay), (1, get), (2, bird), (3, good), (4...","[eat, good, old, asian, tend, stick, wall, goo...","[(0, it), (1, i), (2, you), (3, 's), (4, they)..."
3,ABCE2,Ah look l Les Pete.\n.\n Simon.\n G'day Peto....,False,east,10694,"[Ah, look, l, Les, Pete, ., ., Simon, ., G'day...","[ Ah look l Les Pete., ., Simon., G'day Peto.,...",59,"[ah, look, l, les, pete, simon, peto, simo, le...","[(0, well), (1, yes), (2, would), (3, got), (4...","[l, pete, peto, simo, good, morning, gentleman...","[(0, it), (1, 's), (2, you), (3, i), (4, uh), ..."
4,ABCE3,If you haven't been with us before this how i...,False,east,6560,"[If, you, have, n't, been, with, us, before, t...",[ If you haven't been with us before this how ...,88,"[us, functions, jurate, sasnaitis, joins, us, ...","[(0, think), (1, really), (2, book), (3, like)...","[function, join, couple, week, month, book, cl...","[(0, i), (1, it), (2, 's), (3, you), (4, she),..."


In [335]:
avg_len_sentences_by_com = df_clean.groupby('commercial')['avg_len_sentences'].mean().round()
print("Average length of sentences: commercial vs. public")
print(avg_len_sentences_by_com)

Average length of sentences: commercial vs. public
commercial
False     95.0
True     102.0
Name: avg_len_sentences, dtype: float64


We do not observe any significant differnce in average sentence lengths between transcripts for commercial and non-commercial stations.

# 5. Concordance Analysis

To assess the significance of frequently occuring terms, we can conduct a concordance analysis to look at a representative sample of where these words occur in the transcripts.

In [336]:
from nltk.text import Text

corpus_tokens = [t for sublist in df_clean['raw_tokens'].tolist() for t in sublist]
corpus_text = Text(corpus_tokens)

In [337]:
def list_concords(tokens, words, width=79, lines=5, p=True):
  text = Text(tokens)
  con_list = text.concordance_list(words, width=width, lines=lines)
  if p == True:
    print(f"{lines} concordance examples for {words}")
  concordances = []
  for i, j in enumerate(con_list):
    if p == True:
      print(i+1, ":  ", con_list[i].line)
    concordances.append(con_list[i].line)
  return concordances

### 5.2.1 'Well'

In [338]:
well_concordances = list_concords(corpus_tokens, 'well', lines=10, width=200)

10 concordance examples for well
1 :   o to Tasmania taking things to W A in the cricket . That 's heading in the right direction . Yeah well we 're on top of the table at the moment . We 've we had two outright victories so far which is a
2 :    not the sort of thing that you would find in any other sport nowadays . No there 's certainly uh well al although cricket 's losing that a little too but there 's certainly that gentile um factor tha
3 :   that 's the link in my case . Depends on the value of salt which actually is part of the story as well . Okay . Ah last week someone asked about a Scouse phrase okeone or O K E O N E. Absolutely no tr
4 :    and no one is entirely certain so I 'm a I 'm afraid I 've got a an inconclusive answer there as well but that 's how it came about they reckon . I now have a question about tillies . Tillies . Mm uh
5 :   ulation . Possibly but I n I 've got some references to tilly from Dalby in the thirties which is well before the Brits started . So 

We can roughly ascertain some common usages for the word 'well':
1. "Yeah well". This collocation represents a way for a speaker to signal agreement while adding additional information: "... we're on top of the table at the moment ... "
2. "uh well". This collocation seems to introduce a caveat or nuance, while also functioning as a filler alongside "uh": "No there's certainly uh well although ... "
3. "as well". A conventional usage meaning "also".
4. as above.
5. "well before", an intensifier.


- Bednarek (2014) cites Chafe (1982) in classing <i>well</i> as a feature of linguistic involvement "monitoring ... information flow". (6) above most clearly resembles this case: the speaker here is acknowledging their comprehension of their interlocutor's previous affirmation ("oh definitely yes") and the phrase performs this 'flow' function in transitioning to the next thought: "the name golf originally...".
- (7), (9) and (10) likewise perform this function of 'monitoring information flow'. We can note that the word seems to occur especially often during dialogue between two conversants, which lends credence to Chafe's categorization.

### 5.2.2 'Like'

In [339]:
like_concordances = list_concords(corpus_tokens, 'like', lines=10, width=200)

10 concordance examples for like
1 :   the in an entire game so . Whereas cricket has a certain measured elegance and then you get names like remember John Arlott and and Alan McGilvray and the great commentators of the past uh I think uh 
2 :   all . Absolutely lovely do n't mind if I do . Alright have you got a teaser for us . Yeah you 'll like got you a stack of them actually . Some of these are quite easy but they 're interesting if you d
3 :   n-tens uh Jonathan Green has it in the Cassells Dictionary of Slang and it means to treat someone like a fool . Um the pulling someone 's leg is a nineteenth century story and there are all sorts of v
4 :   n the past for utility do they still do so and can they give us place and time because this looks like an odd distribution . Certainly ute is winning hands down nowadays but the tilly has been around 
5 :   ish yes I have played yes I I not regularly but I have yes . On a scale of one to ten that sounds like about two point seven . Yes it

<ol> a.  Most examples here use <i>like</i> in the sense of 'such as', 'similar to'; this is the case for (1), (3), (4), (5), (8) and (9). </ol>
<ol> b. (2), (6) and (7) are instances of <i>like</i> used as a verb in the sense of 'approve', 'prefer'. </ol>
<ol> c. Chafe classes <i>like</i> as an example of fuzziness, vague language or hedging. (10) is a clear example of this usage. It is not always possible to clearly delineate between the pragmatic usage of the word as a hedge, and its semantic sense as defined in (a), since that usage can effectively diminish the degree of certainty expressed by the speaker. However, we can generally judge whether a phrase such as this is functioning as a hedge by assessing whether the meaning of the utterance changes significantly once the phrase has been removed, in which case it is not functioning as a hedge, or whether the change is only towards heightening the degree of certainty expressed, in which case it is.

### 5.2.3 'Think'

In [340]:
think_concords = list_concords(corpus_tokens, 'think')

5 concordance examples for think
1 :    great commentators of the past uh I think uh allows one to appreciate things o
2 :    mine 's worth about a pound of it I think that 's the link in my case . Depend
3 :   owf C O W F meaning club and so they think that that was where the the name fro
4 :    children as their their sprogs uh I think that it 's parti uh peculiarly Austr
5 :   um there is a dialectal word sprag I think which means lively young man sort of


- In distinction from the previous two examples, all the usages here for 'think' are syntactically equivalent, i.e. all are present tense verbs following a pronoun. Usage is less flexible than for 'well' e.g.
- Bednarek cites the work of Chafe in categorizing "I think" as a form of "ego involvement", understood as "involvement of the speaker with himself [sic]" (Chafe 1985: 116, quoted in Bednarek 2014: 8), or "references to the speaker's mental processes" (Bednarek 2014: 8). Bednarek (2014) points out that "There are still some problematic issues with this three-fold categorization, for example ... phrases such as <i>I think</i>, which ostensibly refer to the speaker's mental process, may function pragmatically as hedges." (9).
- Bednarek is perhaps unfair here in her criticism of Chafe's categorization, since a hedge is arguably still a reference to the "speaker's mental processes", in this case the mental state of uncertainty.
- To assess this question of hedging, we need to expand the context a little.

<h3>'I think' therefore I hedge?</h3>
<b>How do we assess whether a usage of 'I think' is fulfilling a 'hedging' function?</b>
<br>
A plausible test would be to simply remove the phrase, and evaluate whether the resulting utterance has changed its semantic meaning. If the phrase is used as a hedge, we would expect the resulting utterance to be semantically near identical but to express a greater degree of confidence. For example, in a sentence like "I think the train is running late", removing the phrase produces "The train is running late", expressing the same content with a greater deal of confidence. However, in a phrase like "I think therefore I am", removing the phrase produces the nonsensical "therefore I am". Let's run our test and see what we find.

In [341]:
I_think_concords = list_concords(corpus_tokens, ["I", "think"], width=200, lines=10)

10 concordance examples for ['I', 'think']
1 :   ames like remember John Arlott and and Alan McGilvray and the great commentators of the past uh I think uh allows one to appreciate things over several days which is not the sort of thing that you wo
2 :   ry and salt . Mm okay salary and salt alright the link mkay m mine 's worth about a pound of it I think that 's the link in my case . Depends on the value of salt which actually is part of the story 
3 :    were called spoog or sproggy and they used to refer to their children as their their sprogs uh I think that it 's parti uh peculiarly Australian it 's as I I have n't uh come across it despite inqui
4 :    friends at school called sproggy . Yeah I remember that too um there is a dialectal word sprag I think which means lively young man sort of something like that uh but I have n't heard that for y for
5 :   nd what can you tell us what a tilly van was like . Well it was just like a small bus really um I think it seats about oh twent

In [342]:
unhedged = {}
for idx, c in enumerate(I_think_concords):
  c = re.sub("I think|uh|um", "", c)
  unhedged[idx] = c
  print(idx+1, ": ", c)

1 :  ames like remember John Arlott and and Alan McGilvray and the great commentators of the past    allows one to appreciate things over several days which is not the sort of thing that you wo
2 :  ry and salt . Mm okay salary and salt alright the link mkay m mine 's worth about a pound of it  that 's the link in my case . Depends on the value of salt which actually is part of the story 
3 :   were called spoog or sproggy and they used to refer to their children as their their sprogs   that it 's parti  peculiarly Australian it 's as I I have n't  come across it despite inqui
4 :   friends at school called sproggy . Yeah I remember that too  there is a dialectal word sprag  which means lively young man sort of something like that  but I have n't heard that for y for
5 :  nd what can you tell us what a tilly van was like . Well it was just like a small bus really   it seats about oh twenty people twentyish people maybe thirty but it was like a small bus . And
6 :   it . Uh well Gamecoc

If we look at utterance 1, we can see an issue here, which is that the removal of the phrase can render the utterance ungrammatical, even though the sentence would be grammatical with a slight morphemic change. "The great commenators of the past allows one" is ungrammatical, but "The great commentators of the past allow one" is perfectly grammatical. Therefore, before proceeding further, we will try to lemmatize the resulting utterances.

In [343]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

In [344]:
unhedged_docs = {}

for idx, c in unhedged.items():
  doc = nlp(c)
  unhedged_docs[idx] = doc

I_think_lemmas = {}
for idx, doc in unhedged_docs.items():
    I_think_lemmas[idx] = [token.lemma_ for token in doc]

In [345]:
# rejoin lemma lists to strings
lemmatized_utterances = {}
for key, value in I_think_lemmas.items():
  lemmatized_utterances[key] = " ".join(value)
lemmatized_utterances

{0: 'ame like remember John Arlott and and Alan McGilvray and the great commentator of the past     allow one to appreciate thing over several day which be not the sort of thing that you will',
 1: 'ry and salt . mm okay salary and salt alright the link mkay m mine be worth about a pound of it   that be the link in my case . depend on the value of salt which actually be part of the story',
 2: '  be call spoog or sproggy and they use to refer to their child as their their sprog    that it be parti   peculiarly australian it be as I I have not   come across it despite inqui',
 3: '  friend at school call sproggy . yeah I remember that too   there be a dialectal word sprag   which mean lively young man sort of something like that   but I have not hear that for y for',
 4: 'nd what can you tell we what a tilly van be like . well it be just like a small bus really    it seat about oh twenty people twentyish people maybe thirty but it be like a small bus . and',
 5: '  it . uh well Gamecock

- Removing the phrase in <b>1</b> clearly changes the meaning and makes the expression far less fluent. The phrase here was clearly serving an important function in conjoining the first part of the utterance with the second. This usage seemed to allow the speaker a way to introduce a connected thought without needing to precisely formulate the connection between the two clauses in a way that might be expected in written English.
- Looking at the utterances as a whole seems to lend weight to our conjecture that there is no hard-and-fast distinction between the usage of "I think" as a feature of Ego-involvement, a reference to the speaker's mental process, and its pragmatic function as a hedge. All the examples could be interpreted as both of these at once; they hedge precisely <i>by</i> making reference to the speaker's mental process. If we imagine what the usage could be as a "pure" marker of Ego-involvement, without any hedging function, we could imagine an expression like "When I look at the night sky, I think of the smallness of our place in the universe". None of the concordances fulfill this function of pure Ego-involvement.

# 6. Collocations

We have looked at frequently occuring terms, or 1-grams, in isolation, and the 2-gram "I think". Now we will analyze frequently occuring n-grams for n>1.

In [346]:
from nltk import ngrams

def get_top_ngrams(tokens, n=2, top_k=10):
  n_grams = list(ngrams(tokens, n))
  n_gram_freqs = Counter(n_grams)
  top_n_grams = n_gram_freqs.most_common(top_k)
  return top_n_grams

## 6.1 Corpus-wide frequent n-grams

In [347]:
corpus_2grams = get_top_ngrams(filtered_corpus_tokens)
print("Top 10 most frequently occuring 2-grams in corpus:\n")
print("frequency   2-gram\n")
for gram in corpus_2grams:
  print(gram[1], "      ", gram[0])

Top 10 most frequently occuring 2-grams in corpus:

frequency   2-gram

194        ('good', 'morning')
170        ('wan', 'na')
159        ('little', 'bit')
141        ('got', 'ta')
134        ('thank', 'much')
72        ('would', 'like')
64        ('years', 'ago')
58        ('yes', 'yes')
55        ('let', 'go')
53        ('b', 'c')


In [348]:
# applying function to raw transcripts
df_clean['2grams'] = df_clean['filtered_tokens'].apply(get_top_ngrams)
print(df_clean[['filename','2grams']])

   filename                                             2grams
0    ABCNE1  [((one, word), 9), ((good, afternoon), 7), ((b...
1     ABCE1  [((open, garden), 8), ((b, c), 8), ((c, sydney...
2     ABCE4  [((tree, snake), 6), ((queensland, museum), 5)...
3     ABCE2  [((b, c), 14), ((got, ta), 8), ((steel, wool),...
4     ABCE3  [((book, club), 9), ((lovely, bones), 9), ((fo...
6     COME2  [((seven, hills), 19), ((wan, na), 12), ((real...
7      NAT8  [((wan, na), 17), ((refugee, day), 15), ((mand...
8     COME5  [((next, year), 19), ((little, bit), 18), ((so...
9    COMNE2  [((got, ta), 16), ((wan, na), 16), ((four, eig...
10     NAT7  [((super, request), 12), ((got, ta), 6), ((nex...
11    COME7  [((dr, sally), 6), ((sally, cockburn), 6), ((y...
12     NAT5  [((christian, values), 10), ((jesus, would), 1...
13   COMNE1  [((twenty, twenty), 14), ((wan, na), 12), ((go...
14     NAT6  [((dr, karl), 16), ((good, morning), 11), ((go...
15     NAT2  [((tim, winton), 9), ((thank, much), 9), (

In [349]:
pd.set_option('display.max_columns', None)
gram_data = {}
for _, row in df_clean.iterrows():
  filename = row['filename']
  two_grams = row['2grams']
  gram_data[filename] = [f"{gram[0]}: {gram[1]}" for gram in two_grams]
df_2grams = pd.DataFrame.from_dict(gram_data, orient='index')
df_2grams = df_2grams.sort_index()
df_2grams

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
ABCE1,"('open', 'garden'): 8","('b', 'c'): 8","('c', 'sydney'): 8","('something', 'like'): 8","('sort', 'thing'): 7","('thank', 'much'): 7","('yes', 'yes'): 7","('root', 'system'): 7","('cabbage', 'tree'): 7","('jeanne', 'villani'): 6"
ABCE2,"('b', 'c'): 14","('got', 'ta'): 8","('steel', 'wool'): 8","('wan', 'na'): 7","('two', 'pot'): 7","('thank', 'much'): 7","('c', 'sydney'): 7","('building', 'consultant'): 7","('may', 'well'): 6","('central', 'coast'): 6"
ABCE3,"('book', 'club'): 9","('lovely', 'bones'): 9","('fourteen', 'year'): 7","('year', 'old'): 7","('mr', 'harvey'): 6","('alice', 'sebold'): 6","('ever', 'read'): 5","('good', 'afternoon'): 4","('susie', 'salmon'): 4","('eight', 'years'): 4"
ABCE4,"('tree', 'snake'): 6","('queensland', 'museum'): 5","('carpet', 'python'): 5","('asian', 'house'): 4","('house', 'gecko'): 4","('hi', 'kelly'): 4","('b', 'c'): 4","('would', 'say'): 3","('brown', 'tree'): 3","('green', 'tree'): 3"
ABCNE1,"('one', 'word'): 9","('good', 'afternoon'): 7","('b', 'c'): 5","('little', 'bit'): 5","('late', 'fifties'): 5","('okay', 'thank'): 4","('tilly', 'van'): 4","('w', 'w'): 4","('bad', 'air'): 4","('mm', 'okay'): 3"
ABCNE2,"('little', 'bit'): 10","('good', 'afternoon'): 9","('green', 'leaves'): 7","('time', 'year'): 6","('nelly', 'kelly'): 6","('last', 'year'): 6","('blood', 'bone'): 5","('greg', 'kerrin'): 4","('yes', 'yes'): 4","('okay', 'alright'): 4"
COME1,"('good', 'morning'): 42","('little', 'bit'): 22","('garden', 'clinic'): 16","('two', 'g'): 11","('g', 'b'): 10","('wan', 'na'): 9","('morning', 'good'): 9","('thank', 'much'): 9","('fruit', 'fly'): 9","('let', 'go'): 8"
COME2,"('seven', 'hills'): 19","('wan', 'na'): 12","('real', 'estate'): 11","('three', 'years'): 9","('good', 'morning'): 8","('good', 'cause'): 7","('house', 'hearts'): 6","('children', 'hospital'): 6","('two', 'g'): 6","('well', 'done'): 6"
COME3,"('good', 'morning'): 43","('dr', 'graham'): 18","('morning', 'good'): 15","('erectile', 'dysfunction'): 14","('two', 'g'): 11","('g', 'p'): 11","('little', 'bit'): 11","('morning', 'dr'): 9","('years', 'ago'): 8","('polymyalgia', 'rheumatica'): 8"
COME4,"('good', 'morning'): 17","('got', 'ta'): 15","('well', 'well'): 13","('ovarian', 'cancer'): 12","('morning', 'john'): 10","('okay', 'well'): 10","('dawn', 'fraser'): 8","('know', 'know'): 7","('think', 'well'): 6","('lot', 'people'): 5"


We can roughly divide the results of the analysis into two categories for most frequent 2-grams:
1. <b>Topic Indicators</b>: such as 'lawyer bob', 'year twelve', 'hormonal therapy', 'tim winton' and 'australian children'. These frequent n-grams give us an idea of the topic under discussion in that particular broadcast. For example we can reasonably assume that COMNE5 transcribes a discussion about a legal issue, while NAT2 transcribes a discussion about Australian author Tim Winton.
2. <b>Reflexives</b>: These 2-grams refer to a core aspect of the program itself, as opposed to the examples above in <b>1</b>, which referred to topics external to the program. Examples of these include 'book club', 'super request', 'BC' (presumably preceded by 'A' for 'ABC'), 'two g' and 'g b' for the station 2GB, and 'Dr Karl' (a regular co-host of the Triple J segment <i>Science with Dr. Karl</i>: https://www.abc.net.au/triplej/programs/dr-karl-podcast.
3. <b>Common Phrases</b>: Finally there are frequently occuring common phrases like the greeting <i>good morning</i>. The frequency of this phrase in particular is likely due to the program's being a call-in show, such that the phrase is frequently repeated whenever a new caller comes on the line. We can see that 'good morning' is the most frequent 2-gram for 5 of the 15 (so 1/3) of the commercial broadcasts, but most frequent for only 1 of the 14 public broadcasts (NAT3), and 2nd most frequent for another (NAT6, which is a transcription of <i>Science with Dr. Karl</i>, a call-in segment. <i>good afternoon</i> is the 2nd most frequent 2-gram for COMNE7, ABCNE1 and ABCNE2. Bednarek (2014) classifies <i>good morning</i> as a form of reply to a questioner's query, generally the caller's response to the host's asking a question like "how are you?", and points out that "in these talkback data, a positive and friendly atmosphere appears to be created through phatic communication".
4. Finally, there are frequent 2-grams, such as 'one word' which remain <b>ambiguous without context</b>, which we will accordingly provide below.

Concordances: <i>"one word"</i>

In [350]:
ABCNE1_tokens = np.ravel(list(df_clean['raw_tokens'].loc[df_clean['filename'] == 'ABCNE1'].values))
list_concords(ABCNE1_tokens, ["one", "word"])

5 concordance examples for ['one', 'word']
1 :   k we 've got antemeridian which is one word but post-meridian which is hyphena
2 :   of rain then it will be written as one word . But lots of others are hyphenate
3 :    stand by its own but agribusiness one word is okay . There are a couple of od
4 :   x lower case and upper case in the one word . Mm But then you got co and ex . 
5 :   ords are old like coexist they 're one word . If they 're newer like co-author


["k we 've got antemeridian which is one word but post-meridian which is hyphena",
 'of rain then it will be written as one word . But lots of others are hyphenate',
 ' stand by its own but agribusiness one word is okay . There are a couple of od',
 'x lower case and upper case in the one word . Mm But then you got co and ex . ',
 "ords are old like coexist they 're one word . If they 're newer like co-author"]

Concordance analysis seems to suggest that the 2gram 'one word' in ABCNE1 is a topic indicator, and the topic under discussion appears to be related to language itself. We can try to learn more by conducting Named Entity Recognition (NER) on the transcript.

In [351]:
nlp = spacy.load("en_core_web_sm")

In [352]:
ABCNE1_text = df_clean['transcript'].iloc[0]
ABCNE1_doc = nlp(ABCNE1_text)

# store entities and their labels in a dictionary structure
ents_dict = {}
for entity in ABCNE1_doc.ents:
  label = entity.label_
  text = entity.text
  if label not in ents_dict:
    ents_dict[label] = []
  ents_dict[label].append(text)

# print a random sampling from each label
import random
for label, entities in ents_dict.items():
  if len(entities) > 5:
    print(label, ":", random.sample(population=entities, k=5))
  else:
    print(label, ":", entities)

ORG : ['W A', 'Mornington', 'Tilley', 'Midway Point', 'Roly']
TIME : ['this afternoon', 'this afternoon', 'afternoon', 'afternoon', 'afternoon']
GPE : ['Tasmania', 'Ayrshire', 'Oseania', 'America', 'Melbourne']
CARDINAL : ['about eighteen-ninety-one', 'two', 'two', 'two', 'half']
NORP : ['Americans', 'British', 'British', 'French', 'Dutch']
PERSON : ['Roly Sussex', 'Hobart', 'Singer', 'Leon Carol', 'Alan']
DATE : ['today', 'fourteen-fifty-seven', 'the thirties', 'nine-three-six', 'nine-three-six']
PRODUCT : ['G O W F', 'O K E O N E.', 'Punch', 'F B T', 'F B']
EVENT : ['the Cassells Dictionary of Slang']
ORDINAL : ['first', 'nineteenth', 'first', 'second', 'second']
LANGUAGE : ['English', 'English', 'Latin']
LOC : ['Sandy Bay', 'Garden River', 'the South Coast', 'Roly']
FAC : ['Rose Bay', 'Portsmouth Tasmania']
WORK_OF_ART : ['The Twa Dogs']


The results of the NER analysis above suggests that the program did indeed feature some kind of lexical discussion, as suggested by entities like 'the Cassells Dictionary of Slang', 'English' and 'Latin' (the last likely relevant to word etymology).

## 6.2 Collocations measured by Pointwise Mutual Information (PMI) score

Besides finding collocations through n-grams measuring raw frequency of occurence, we can also use statistical measures like Pointwise Mutual Information (PMI) score to find those terms which not only occur together frequently in absolute terms, but occur together frequently relative to their occurence independently of one another.

### 6.2.1 Corpus-wide

In [353]:
# Here we are creating a spaCy doc for the whole corpus of filtered tokens, in order to perform NER
filtered_corpus = " ".join(filtered_corpus_tokens)
filtered_corpus_doc = nlp(filtered_corpus)

#### 6.2.1.1 PMI Bigrams

In [354]:
from nltk.collocations import *

corpus_bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(filtered_corpus_tokens)
# we will filter out bigrams which occur less than three times
finder.apply_freq_filter(3)
pmi_corpus_bigrams_top20 = finder.nbest(corpus_bigram_measures.pmi, 20)

print("Corpus bigrams: top 10 most collocated by PMI (n>2)\n")
for idx, bigram in enumerate(pmi_corpus_bigrams_top20):
  print(idx+1, " ".join(bigram))

Corpus bigrams: top 10 most collocated by PMI (n>2)

1 deca durabolin
2 golda meir
3 infantile paralysis
4 port lincoln
5 sebaceous cysts
6 feng shui
7 frenzal rhomb
8 jung chang
9 pride madeira
10 coral reefs
11 grondslag beslaan
12 mallee ringneck
13 africa ruling
14 asylum seekers
15 formerly banned
16 nasser hussain
17 tassel fern
18 irish airways
19 naomi watts
20 nutrient agar


We can divide these collocations into 8 main categories:
1. <b>Names of people</b>, like Golda Meir, Naomi Watts, Jung Chang and Nasser Hussain.
2. <b>Place names</b>, like Port Lincoln.
3. <b>Scientific terms</b>, like the common species names Malee Ringneck (an Australian bird species) and Tassel Fern; as well as chemical compounds like 'Nutrient agar' (a chemical compound), and medical terms like 'sebaceous cysts' and 'infantile paralysis' and 'coral reefs'.
4. <b>Group names</b>, like Frenzal Rhomb (a rock band).
5. <b>Brand names</b>, like 'Deca-Durabolin' and 'Irish Airways'.
6. <b>Loan phrases</b>, like 'feng shui', which are only comprehensible in English as collocations (i.e. neither 'feng' nor 'shui' are English words on their own).
7. <b>Descriptors of type or status</b>, like 'asylum seekers'.
8. Finally we have what might be called <b>pure collocations</b>, meaning those collocations which are not compound terms, and have no necessary internal connection, but occur together frequently. For these we have 'Africa ruling' and 'formerly banned'.
(9.) 'grondslag beslaan' is incomprehensible to this author, so we conduct a quick concordance analysis below to aid interpretation.

In [355]:
grondslag = list_concords(corpus_tokens, ['grondslag', 'beslaan'], width=100)

5 concordance examples for ['grondslag', 'beslaan']
1 :    I done here yeah what she was after was grondslag beslaan . That 's ground cover in Dutch . Gronds
2 :   eslaan . That 's ground cover in Dutch . Grondslag beslaan . Uh grondslag beslaan . Beslaan . I 'll
3 :    cover in Dutch . Grondslag beslaan . Uh grondslag beslaan . Beslaan . I 'll remember that not . Ye


This is a highly obscure Dutch phrase, which meets the frequency threshold of 3 only because its meaning had to be clarified between caller and host. This represents a feature of the semi-spontaneous dialogical character of RT data, which requires conversants to clarify meaning between themselves. Such repetitions can be classed as involvement marker under Chafe's taxonomy, in that they help to monitor information flow between conversants.

#### 6.3.1.2 PMI Trigrams

In [356]:
corpus_trigram_measures = nltk.collocations.TrigramAssocMeasures()
trigram_finder = TrigramCollocationFinder.from_words(filtered_corpus_tokens)
trigram_finder.apply_freq_filter(3)
pmi_corpus_trigrams_top20 = trigram_finder.nbest(corpus_trigram_measures.pmi, 20)

print("Corpus trigrams: 20 most PMI-collocated (n>2)\n")
for idx, trigram in enumerate(pmi_corpus_trigrams_top20):
  print(idx+1, " ".join(trigram))

Corpus trigrams: 20 most PMI-collocated (n>2)

1 gandalf king theoden
2 africa ruling coalition
3 bling bling bling
4 hm hm hm
5 treatment infantile paralysis
6 jung chang tells
7 formerly banned organisation
8 novel jung chang
9 morton bay fig
10 toomie james lance
11 james lance medicines
12 choke death ballpoint
13 death ballpoint pens
14 successful outcome pregnancy
15 novelist wrote tobacco
16 blue tassel fern
17 gay intersex transgender
18 developed treatment infantile
19 lance medicines research
20 cordyline red sensation


- Of these trigrams, (2), (5), (6), (7), (8) and (16) are composed of bigrams in the top 20 above, and so help to clarify the true meaning of the collocation. Meanwhile, (18) shares one word with the highly collocated bigram "infantile paralysis", which might suggest the words form the quadrigram "developed treatment [for] infantile paralysis" (we validate this hypothesis below).
- By comparing 2- and 3-gram collocations, we can observe the following patterns in PMI-collocated ngrams:
  -  Medical phrases remain prominent, under this heading we can include (5), (11), (12), (13), (14), (18) and (19). Because item (10) shares the bigram "james lance" with (11), we can infer that this phrase too belongs to the medical genre.
  - Names of plants remain prominent too, with the cases of (9), (16) and (20).
  - Certain phrases collocate highly with authors, as with (6) and (8).
- In distinction from the bigrams, we also have the following cases:
 - (1) is highly collocated because all three terms refer to characters from the same fictional universe.
  - (2) "bling bling bling" represents a phrase taken from African-American Vernacular English (AAVE), and has been described by Kibin (2024) as an 'ideophone ... it is not onomatopoeia, because the act of jewelry shining does not make a sound ... The form "bling-bling" is a case of reduplication". The additional bling in the trigram therefore may be taken as an additional emphasis, accentuating the 'flaunting' function of the expression. (we further explore the phrase below through Concordance Analysis(CA)).
  - (4) is distinct as it contains no words, but rather a <i>vocalization</i> that generally expresses contemplation or consideration. This phrase can probably be classed as another 'involvement feature', a way of monitoring the flow of information between conversants, here likely indicating something like "yes I'm following" or "that's interesting to consider". (again we assess the plausibility of this hypothesis through Concordance Analysis below).
  - (15) is peculiar and require some CA to assess properly (see below).
  - (17) introduces a new PMI-collocated topic, that of sexuality.

#### 6.3.1.3 Trigram Concordances

3. <i>bling bling bling</i>

In [357]:
bling_concords = list_concords(corpus_tokens, ['bling', 'bling', 'bling'], width=200)

5 concordance examples for ['bling', 'bling', 'bling']
1 :    got your favourite um shirt on today I like that one . The country and western one . Yeah bling bling bling bling bling bling bling . I met Beck in this one . You met be oh is that the famous Beck s
2 :   our favourite um shirt on today I like that one . The country and western one . Yeah bling bling bling bling bling bling bling . I met Beck in this one . You met be oh is that the famous Beck shirt .
3 :   vourite um shirt on today I like that one . The country and western one . Yeah bling bling bling bling bling bling bling . I met Beck in this one . You met be oh is that the famous Beck shirt . Yeah 
4 :   e um shirt on today I like that one . The country and western one . Yeah bling bling bling bling bling bling bling . I met Beck in this one . You met be oh is that the famous Beck shirt . Yeah . And 
5 :   hirt on today I like that one . The country and western one . Yeah bling bling bling bling bling bling bling . I me

The frequency of the <b>bling</b> trigram is clearly distorted by its sevenfold repetition in this one instance. Such anomalies are a feature of transcribed speech data, where there are no stylistic or editorial constraints on usage. It is also in particular a feature of TR data: whereas the speech of the host may be constrained by some professional norms, the speech of the guest is not. However, we do learn something about the word 'bling', namely its capacity to remain meaningful even when consecutively repeated many times over. This likely has something to do with its pragmatic function as a <i>flaunt</i> or <i>flex</i>; it does not primarily convey semantic content, but signals an attitude or disposition. We can also observe here the use of palilogia to 'fill dead air', functioning as filler to lubricate social interaction.

4. <i>hm hm hm<i/>

In [358]:
hm_concords = list_concords(corpus_tokens, ['hm', 'hm', 'hm'], width=200)

5 concordance examples for ['hm', 'hm', 'hm']
1 :    to get better . Y'know the other thing is if you do n't go to the doctor you do n't get sick . Hm hm hm hm hm It 's exactly . Barbara hello . Um good morning Neil and good morning Dr Cockburn . Um lo
2 :    get better . Y'know the other thing is if you do n't go to the doctor you do n't get sick . Hm hm hm hm hm It 's exactly . Barbara hello . Um good morning Neil and good morning Dr Cockburn . Um look 
3 :   t better . Y'know the other thing is if you do n't go to the doctor you do n't get sick . Hm hm hm hm hm It 's exactly . Barbara hello . Um good morning Neil and good morning Dr Cockburn . Um look I w


As with (3), the context reveals that the PMI here derives from a single case of consecutive repetition, fivefold in this instance. We see here again a feature of TR data's preserving non-lexical vocalizations.

15. <i>novelist wrote tobacco</i>

In [359]:
tobacco_concords = list_concords(corpus_tokens, ['novelist', 'wrote', 'tobacco'], width=200)

5 concordance examples for ['novelist', 'wrote', 'tobacco']
1 :   do n't think I could remember the author . You may do better . Twenty-one which American novelist wrote Tobacco Road . My grandma had a copy of it it was very dog eared . Which American novelist wrote
2 :    wrote Tobacco Road . My grandma had a copy of it it was very dog eared . Which American novelist wrote Tobacco Road . Um look I 'm sure it 's wrong but I 'll just suggest John Steinbeck . Mm no it 's


In [360]:
tobacco = df_clean[df_clean['transcript'].str.contains("novelist wrote Tobacco")]
print(tobacco.filename)

16    NAT4
Name: filename, dtype: object


The PMI for this collocation appears to be a consequence of its occurence in a quiz segment. Radio quizzes seem to present a case where highly marked (unusual) collocations will be repeated multiple times as the host repeats the question and/or the caller ponders it. The PMI score here may be a fluke of the data; a larger corpus may have more incidences of the word 'tobacco' reducing the PMI of 'novelist wrote tobacco'.

# Conclusion

Our analysis corroborates Bednarek's 2014 finding that "talkback radio is a highly 'involved' form of spoken mass media discourse, despite its regulated, mediated and institutionalized nature" (19). We find frequent occurences of terms and phrases which:

- Monitor information flow between participants.
- Make reference to speakers' mental processes.
- Function as hedges.

We have also addressed some of the difficulties in categorizing certain phrases, which can have multiple pragmatic functions.
<br><br>
We have addressed some of the challenges in analyzing Talkback Radio data, such as:

 - Frequent occurences of misfluencies, contractions and transcription errors or ambiguities.
 - Minimal stylistic constraints such as might be imposed on more formal speech, which results in highly marked speech patterns.

On the other hand, TR data affords rich opportunities for Natural Language Analysis as a form of speech intermediate between scripted and spontaneous dialogue.
<br><br>
There remain many avenues of analysis left untouched here, which would no doubt be worth pursuing. These include, but are not limited to:

- Comparative analysis of speech patterns in commercial versus public broadcasts.
- Comparative analysis of transcripts by region, and comparison of national with regional broadcasts.
- Analysis of conversational dynamics, for example a comparison of host-caller interactions versus interactions with guests and co-hosts; comparative analyses by types of guests.
- Time-series analysis of RT data based on factors like time-of-day, day-of-week, month and year.
- Comparative analysis of Australian RT with RT data from other English-speaking countries.

# Bibliography

- Bednarek, M. (2014). 'Involvement in Australian Talkback Radio--A Corpus Linguistic Investigation'. <i>Australian Journal of Linguistics</i>, 34:1, 4-23. DOI: https://www.tandfonline.com/action/showCitFormats?doi=10.1080/07268602.2014.875453.
<br>
- Kibin. (2024). <i>An analysis of the term bling</i>. http://www.kibin.com/essay-examples/an-analysis-of-the-term-bling-S8owGCIZ.
<br>
- Peters, P. (2009). 'The mandative subjunctive in spoken English'. <i>Comparative studies in Australian and New Zealand English grammar and beyond</i>(1st ed.). Eds. Peters, P., Collins, P., & Smith, A. (2009). John Benjamins Pub. Co.
<br>
- Pinker, S. (2014). <i>The sense of style: the thinking person's guide to writing in the 21st century.</i> Viking.
