# RNZ Climate Corpus Analysis

## Introduction

This notebook analyses the RNZ Climate corpus, comparing **national** (domestic New Zealand) and **international** (global) climate coverage from Radio New Zealand. The analysis applies corpus linguistics methods to examine differences in climate change discourse between domestic and international reporting.

This comparative approach uses the national coverage as the target corpus and international coverage as the reference corpus, revealing what is distinctive about New Zealand's domestic climate discourse.

In [29]:
import polars as pl
from pathlib import Path
from collections import Counter
import time
import math

In [30]:
DATA_PATH = Path('D:/github/DIGI405/data_raw')

RESULTS_DIR = Path('../results')
FIGS_DIR    = Path('../figs')

RESULTS_DIR.mkdir(exist_ok=True)
FIGS_DIR.mkdir(exist_ok=True)

In [43]:
# Load the CSV files (already split into national and international)
national_df = pl.read_csv(DATA_PATH / 'rnz_climate_national.csv.gz')
international_df = pl.read_csv(DATA_PATH / 'rnz_climate_international.csv.gz')

# Tokenize fulltext using basic splitting and counting
import re

def tokenize_and_count(text):
    """Simple tokenization: lowercase and split on word boundaries"""
    if text is None or text == '':
        return []
    tokens = re.findall(r'\b\w+\b', text.lower())
    return tokens

# Process national corpus
national_tokens = []
for row in national_df.select('fulltext').to_dicts():
    if row['fulltext']:
        national_tokens.extend(tokenize_and_count(row['fulltext']))

national_token_counts = Counter(national_tokens)

# Process international corpus
international_tokens = []
for row in international_df.select('fulltext').to_dicts():
    if row['fulltext']:
        international_tokens.extend(tokenize_and_count(row['fulltext']))

international_token_counts = Counter(international_tokens)

# Create frequency DataFrames
vocab_national = pl.DataFrame({
    'token': list(national_token_counts.keys()),
    'frequency': list(national_token_counts.values())
}).sort('frequency', descending=True)

vocab_international = pl.DataFrame({
    'token': list(international_token_counts.keys()),
    'frequency': list(international_token_counts.values())
}).sort('frequency', descending=True)

total_national = sum(national_token_counts.values())
total_international = sum(international_token_counts.values())

print(f"National: {total_national:,} tokens, {len(national_token_counts):,} types")
print(f"International: {total_international:,} tokens, {len(international_token_counts):,} types")

National: 3,515,920 tokens, 46,840 types
International: 2,148,137 tokens, 43,578 types


In [32]:
def log_likelihood(a, b, c, d):
    """Calculate log-likelihood statistic for 2x2 contingency table.
    a = frequency in corpus 1, b = frequency in corpus 2
    c = total in corpus 1, d = total in corpus 2"""
    e1 = c * (a + b) / (c + d)
    e2 = d * (a + b) / (c + d)
    
    if a == 0:
        g1 = 0
    else:
        g1 = a * math.log(a / e1)
    
    if b == 0:
        g2 = 0
    else:
        g2 = b * math.log(b / e2)
    
    return 2 * (g1 + g2)

## Comparative Frequency Analysis

The following cells examine the most frequent tokens in both national and international climate coverage. This comparison reveals which terms are used more frequently in domestic New Zealand climate reporting versus international climate news.

The analysis includes statistical measures:
- **Normalized Frequency**: Frequency per million tokens (allows direct comparison)
- **Relative Risk**: Ratio of normalized frequencies (>1 = more frequent in national coverage)
- **Log Ratio**: Effect size measure showing magnitude of difference

In [60]:
output_file = RESULTS_DIR / 'frequency_analysis.txt'

comparison = vocab_national.join(vocab_international, on='token', how='outer', suffix='_int').fill_null(0)
comparison = comparison.with_columns([
    ((pl.col('frequency') / total_national) * 1000000).alias('norm_freq_nat'),
    ((pl.col('frequency_int') / total_international) * 1000000).alias('norm_freq_int')
])

comparison = comparison.with_columns([
    (pl.col('norm_freq_nat') / pl.col('norm_freq_int')).alias('relative_risk'),
    (pl.col('norm_freq_nat') / pl.col('norm_freq_int')).log(2).alias('log_ratio')
])

comparison = comparison.sort('frequency', descending=True).head(50)

with open(output_file, 'w', encoding='utf-8') as f:
    f.write("COMPARATIVE FREQUENCY ANALYSIS\n")
    f.write("Target: RNZ National Climate Coverage\n")
    f.write("Reference: RNZ International Climate Coverage\n\n")
    
    f.write(f"{'Rank':<5} {'Token':<12} {'Nat Freq':>9} {'Int Freq':>9} {'Norm Nat':>9} {'Norm Int':>9} {'RelRisk':>7} {'LogRat':>7}\n")
    f.write(f"{'':<5} {'':<12} {'':<9} {'':<9} {'(per M)':>9} {'(per M)':>9} {'':<7} {'':<7}\n")
    f.write("-" * 78 + "\n")
    
    for i, row in enumerate(comparison.iter_rows(named=True), 1):
        token = row['token']
        freq_nat = row['frequency']
        freq_int = row['frequency_int']
        norm_nat = row['norm_freq_nat']
        norm_int = row['norm_freq_int']
        rel_risk = row['relative_risk']
        log_ratio = row['log_ratio']
        
        # Truncate long tokens
        token_display = token[:12]
        f.write(f"{i:<5} {token_display:<12} {freq_nat:>9,} {freq_int:>9,} {norm_nat:>9.1f} {norm_int:>9.1f}")
        
        if math.isfinite(rel_risk) and not math.isnan(rel_risk):
            f.write(f" {rel_risk:>7.2f}")
        else:
            f.write(f" {' --':>7}")
        
        if math.isfinite(log_ratio) and not math.isnan(log_ratio):
            f.write(f" {log_ratio:>7.2f}\n")
        else:
            f.write(f" {' --':>7}\n")
    
    f.write(f"\nTotal tokens - National: {total_national:,}\n")
    f.write(f"Total tokens - International: {total_international:,}\n")

(Deprecated in version 0.20.29)
  comparison = vocab_national.join(vocab_international, on='token', how='outer', suffix='_int').fill_null(0)


## Climate-Related Vocabulary Comparison

This section examines specific climate-related terms to understand how their usage differs between national and international coverage. Statistical measures (Log Likelihood) indicate whether differences are statistically significant.

Terms with Log Likelihood > 3.84 show statistically significant differences (p < 0.05).
Terms with Log Likelihood > 10.83 are highly significant (p < 0.001).

In [57]:
output_file = RESULTS_DIR / 'climate_terms.txt'

climate_terms = [
    'climate', 'carbon', 'emissions', 'warming', 'greenhouse',
    'temperature', 'fossil', 'renewable', 'sustainability', 'pollution',
    'biodiversity', 'extinction', 'ice', 'drought'
]

results = []
for term in climate_terms:
    nat_match = vocab_national.filter(pl.col('token').str.to_lowercase() == term.lower())
    int_match = vocab_international.filter(pl.col('token').str.to_lowercase() == term.lower())
    
    freq_nat = nat_match['frequency'].sum() if nat_match.shape[0] > 0 else 0
    freq_int = int_match['frequency'].sum() if int_match.shape[0] > 0 else 0
    
    norm_nat = (freq_nat / total_national) * 1000000
    norm_int = (freq_int / total_international) * 1000000
    
    if freq_int > 0:
        rel_risk = norm_nat / norm_int
        log_ratio = math.log2(norm_nat / norm_int) if norm_nat > 0 else float('-inf')
        ll = log_likelihood(freq_nat, freq_int, total_national, total_international)
    else:
        rel_risk = float('inf') if freq_nat > 0 else float('nan')
        log_ratio = float('inf') if freq_nat > 0 else float('nan')
        ll = float('nan')
    
    results.append({
        'term': term,
        'freq_nat': freq_nat,
        'freq_int': freq_int,
        'norm_nat': norm_nat,
        'norm_int': norm_int,
        'rel_risk': rel_risk,
        'log_ratio': log_ratio,
        'log_likelihood': ll
    })

with open(output_file, 'w', encoding='utf-8') as f:
    f.write("CLIMATE-RELATED TERMS - COMPARATIVE ANALYSIS\n")
    f.write("Target: RNZ National Climate Coverage\n")
    f.write("Reference: RNZ International Climate Coverage\n\n")
    
    f.write(f"{'Term':<14} {'Nat Freq':>9} {'Int Freq':>9} {'Norm Nat':>9} {'Norm Int':>9} {'RelRisk':>7} {'LogRat':>7} {'LogLik':>7}\n")
    f.write(f"{'':<14} {'':<9} {'':<9} {'(per M)':>9} {'(per M)':>9} {'':<7} {'':<7} {'':<7}\n")
    f.write("-" * 82 + "\n")
    
    for r in sorted(results, key=lambda x: x['freq_nat'], reverse=True):
        term_display = r['term'][:14]
        f.write(f"{term_display:<14} {r['freq_nat']:>9,} {r['freq_int']:>9,} {r['norm_nat']:>9.1f} {r['norm_int']:>9.1f}")
        
        if math.isfinite(r['rel_risk']) and not math.isnan(r['rel_risk']):
            f.write(f" {r['rel_risk']:>7.2f}")
        else:
            f.write(f" {' --':>7}")
        
        if math.isfinite(r['log_ratio']) and not math.isnan(r['log_ratio']):
            f.write(f" {r['log_ratio']:>7.2f}")
        else:
            f.write(f" {' --':>7}")
        
        if math.isfinite(r['log_likelihood']) and not math.isnan(r['log_likelihood']):
            f.write(f" {r['log_likelihood']:>7.1f}\n")
        else:
            f.write(f" {' --':>7}\n")
    
    f.write(f"\nTotal tokens - National: {total_national:,}\n")
    f.write(f"Total tokens - International: {total_international:,}\n")
    f.write(f"\nInterpretation:\n")
    f.write(f"- Relative Risk > 1: term more frequent in national coverage\n")
    f.write(f"- Relative Risk < 1: term more frequent in international coverage\n")
    f.write(f"- Log Likelihood > 3.84: statistically significant (p < 0.05)\n")
    f.write(f"- Log Likelihood > 10.83: highly significant (p < 0.001)\n")

## Corpus Size Comparison

This section compares the size and composition of the national and international coverage subcorpora.

In [None]:
output_file = RESULTS_DIR / 'document_analysis.txt'

with open(output_file, 'w', encoding='utf-8') as f:
    f.write("CORPUS SIZE COMPARISON\n")
    f.write("National vs International Coverage\n\n")
    
    nat_docs = national_df.select('id').n_unique()
    int_docs = international_df.select('id').n_unique()
    
    f.write(f"National documents:      {nat_docs:>8,}\n")
    f.write(f"International documents: {int_docs:>8,}\n")
    f.write(f"Total documents:         {nat_docs + int_docs:>8,}\n\n")
    
    f.write(f"National tokens:         {total_national:>12,}\n")
    f.write(f"International tokens:    {total_international:>12,}\n")
    f.write(f"Total tokens:            {total_national + total_international:>12,}\n\n")
    
    nat_types = vocab_national.shape[0]
    int_types = vocab_international.shape[0]
    
    f.write(f"National vocabulary:     {nat_types:>8,} types\n")
    f.write(f"International vocabulary:{int_types:>8,} types\n")

## Results

The following sections present the output from the analysis above.

In [61]:
output_file = RESULTS_DIR / 'frequency_analysis.txt'
with open(output_file, 'r', encoding='utf-8') as f:
    print(f.read())

COMPARATIVE FREQUENCY ANALYSIS
Target: RNZ National Climate Coverage
Reference: RNZ International Climate Coverage

Rank  Token         Nat Freq  Int Freq  Norm Nat  Norm Int RelRisk  LogRat
                                         (per M)   (per M)                
------------------------------------------------------------------------------
1     the            202,077   139,879   57474.9   65116.4    0.88   -0.18
2     to             109,099    65,821   31030.0   30641.0    1.01    0.02
3     and             94,414    55,886   26853.3   26016.0    1.03    0.05
4     of              83,449    58,253   23734.6   27117.9    0.88   -0.19
5     a               68,984    40,319   19620.5   18769.3    1.05    0.06
6     in              61,236    44,897   17416.8   20900.4    0.83   -0.26
7     that            39,785    21,599   11315.7   10054.8    1.13    0.17
8     it              37,866    15,619   10769.9    7271.0    1.48    0.57
9     for             37,658    21,674   10710.7   1008

In [59]:
output_file = RESULTS_DIR / 'climate_terms.txt'
with open(output_file, 'r', encoding='utf-8') as f:
    print(f.read())

CLIMATE-RELATED TERMS - COMPARATIVE ANALYSIS
Target: RNZ National Climate Coverage
Reference: RNZ International Climate Coverage

Term            Nat Freq  Int Freq  Norm Nat  Norm Int RelRisk  LogRat  LogLik
                                     (per M)   (per M)                        
----------------------------------------------------------------------------------
climate           15,534    13,963    4418.2    6500.1    0.68   -0.56  1082.0
emissions          7,976     2,235    2268.5    1040.4    2.18    1.12  1208.8
carbon             3,942     1,222    1121.2     568.9    1.97    0.98   477.7
greenhouse         1,729       725     491.8     337.5    1.46    0.54    75.8
fossil             1,093       890     310.9     414.3    0.75   -0.41    39.9
warming              881     1,081     250.6     503.2    0.50   -1.01   236.8
renewable            783       412     222.7     191.8    1.16    0.22     6.1
temperature          616       652     175.2     303.5    0.58   -0.79    94

In [51]:
output_file = RESULTS_DIR / 'document_analysis.txt'
with open(output_file, 'r', encoding='utf-8') as f:
    print(f.read())

CORPUS SIZE COMPARISON
National vs International Coverage

National documents:         6,059
International documents:    4,928
Total documents:           10,987

National tokens:            3,515,920
International tokens:       2,148,137
Total tokens:               5,664,057

National vocabulary:       46,840 types
International vocabulary:  43,578 types



## Summary of Key Findings

### Comparative Analysis Overview

This analysis compares **RNZ National** (domestic New Zealand) climate coverage with **RNZ International** (global) climate coverage to identify what is distinctive about New Zealand's domestic climate discourse.

### Statistical Measures Used

- **Normalized Frequency**: Occurrences per million tokens (allows direct comparison between corpora of different sizes)
- **Relative Risk**: Ratio of normalized frequencies (>1 indicates higher use in national coverage, <1 indicates higher use in international coverage)
- **Log Ratio**: Effect size measure showing magnitude of difference
- **Log Likelihood**: Statistical significance test (>3.84 = p<0.05; >10.83 = p<0.001)

### Climate Terms Analysis Highlights

The comparative analysis reveals which climate-related terms are emphasized differently in national versus international coverage. Terms with high Relative Risk values (>1) are characteristic of New Zealand's domestic climate discourse, while terms with low Relative Risk values (<1) are more prominent in international climate reporting.

Statistical significance (Log Likelihood) confirms whether observed differences are meaningful or could occur by chance.

### Interpretation Notes

- **Relative Risk = 2.0**: Term appears twice as frequently (per million tokens) in national coverage
- **Relative Risk = 0.5**: Term appears half as frequently in national coverage (more common internationally)
- **Log Ratio = 1.0**: One bit of information difference (doubling)
- **Log Likelihood > 10.83**: Highly significant difference (p < 0.001)

## Further Analysis

Additional corpus linguistics methods that could be applied to this comparative analysis include:

- **Concordancing (KWIC)**: Examine how climate terms are used in context in each corpus
- **Collocations**: Compare which words co-occur with climate terms in national vs international coverage
- **N-grams**: Identify characteristic multi-word phrases in each corpus
- **Keywords Analysis**: Statistical identification of the most distinctive terms in each corpus
- **Dispersion**: Track how climate discourse patterns change over time in each corpus
- **Semantic Analysis**: Examine whether the same terms are used in different semantic contexts

These methods will help build a more complete picture of how climate change discourse differs between New Zealand's domestic and international reporting.