\documentclass[10pt]{article}

# RNZ Climate Corpus Analysis

## Introduction

We dont know what the assignment is about yet, however I am guessing. This notebook analyses the RNZ Climate corpus using the Conc library, comparing **national** (domestic New Zealand) and **international** (global) climate coverage from Radio New Zealand. The analysis applies corpus linguistics methods to examine differences in climate change discourse between domestic and international reporting. 

The approach uses the national  as the target corpus and international coverage as the reference corpus, revealing what is distinctive about New Zealand's domestic climate discourse. The idea is that without instrucitons, that this effort will familiarise myself with the Conc library and its application to corpus analysis.

In [35]:
from conc.corpus import Corpus
from conc.conc import Conc
from pathlib import Path

In [36]:
SERVER_PATH = '/srv/corpora/'
LOCAL_PATH = 'D:/github/DIGI405/corpora/'
BASE_PATH = Path(LOCAL_PATH or SERVER_PATH)

RESULTS_DIR = Path('../results')
FIGS_DIR = Path('../figs')

RESULTS_DIR.mkdir(exist_ok=True)
FIGS_DIR.mkdir(exist_ok=True)

In [37]:
# Load the pre-built corpora
try:
    national_corpus = Corpus().load(BASE_PATH / 'rnz-climate-national.corpus')
    print("NATIONAL CORPUS:")
    national_corpus.summary()
    national_conc = Conc(national_corpus)
    has_national = True
except Exception as e:
    print(f"Could not load national corpus: {e}")
    has_national = False

print("\n" + "="*80 + "\n")

try:
    international_corpus = Corpus().load(BASE_PATH / 'rnz-climate-international.corpus')
    print("INTERNATIONAL CORPUS:")
    international_corpus.summary()
    international_conc = Conc(international_corpus)
    has_international = True
except Exception as e:
    print(f"Could not load international corpus: {e}")
    print("Run: D:\\Python\\python.exe scripts/build_rnz_corpora.py to build missing corpora")
    has_international = False

if has_national and has_international:
    print("\nBoth corpora loaded successfully - ready for comparative analysis!")
elif has_national:
    print("\nOnly national corpus loaded - limited analysis available")
else:
    print("\nNo corpora loaded - cannot proceed with analysis")

Could not load national corpus: Path 'D:\github\DIGI405\corpora\rnz-climate-national.corpus' does not contain all expected files: ['corpus.json', 'vocab.parquet', 'tokens.parquet', 'puncts.parquet', 'spaces.parquet']


Could not load international corpus: Path 'D:\github\DIGI405\corpora\rnz-climate-international.corpus' is not a directory
Run: D:\Python\python.exe scripts/build_rnz_corpora.py to build missing corpora

No corpora loaded - cannot proceed with analysis


## Keywords Analysis

Keywords analysis identifies terms that are statistically distinctive in the national corpus compared to international coverage. This reveals what is characteristic of New Zealand's domestic climate discourse.

### Statistical Measures Interpretation

#### Log Likelihood Ratio (LLR)
| LLR Value | P-Value | Interpretation |
|-----------|---------|----------------|
| > 3.84    | p < 0.05   | Statistically significant |
| > 6.63    | p < 0.01   | Highly significant |
| > 10.83   | p < 0.001  | Very highly significant |
| > 15.13   | p < 0.0001 | Extremely significant |

#### Relative Risk
| Relative Risk | Interpretation |
|--------------|----------------|
| = 1          | Equal frequency in both corpora |
| > 1          | Over-represented in national corpus (characteristic of NZ discourse) |
| < 1          | Under-represented in national corpus (more common internationally) |
| > 2          | Twice as frequent in national coverage |
| > 5          | Strong over-representation in national coverage |

#### Log Ratio
| Log Ratio | Interpretation |
|-----------|----------------|
| â‰ˆ 0       | Similar frequency in both corpora |
| > 1       | Noticeable over-use in national corpus |
| > 2       | Strong over-use in national corpus |
| > 3       | Very strong over-use in national corpus |
| < -1      | Noticeable under-use in national corpus |
| < -2      | Strong under-use in national corpus |

Note: Positive values indicate over-representation in national coverage; negative values indicate under-representation.

In [38]:
# Generate keywords analysis: national (target) vs international (reference)
if has_national and has_international:
    keywords_df = national_conc.keywords(
        reference_corpus=international_corpus,
        effect_size_measure='log_likelihood',
        statistical_significance_cut=3.84,
        min_frequency=5
    )
    
    keywords_df.display()
    
    # Save to file
    output_file = RESULTS_DIR / 'keywords_analysis.txt'
    keywords_df.save(output_file)
else:
    print("Keywords analysis requires both national and international corpora")
    print("Build rnz-climate-international.corpus to enable comparative analysis")

Keywords analysis requires both national and international corpora
Build rnz-climate-international.corpus to enable comparative analysis


## Collocations Analysis

Collocations reveal which words frequently co-occur with key climate terms. This analysis examines the linguistic context around "climate" to understand how the term is framed in national vs international coverage.

### Mutual Information (MI) Interpretation

| MI Score | Interpretation |
|----------|----------------|
| > 3      | Meaningful collocation |
| > 6      | Strong collocation |
| > 8      | Very strong collocation |

Higher MI scores indicate stronger lexical associations between words.

In [39]:
if has_national and has_international:
    # Collocations for "climate" in national corpus
    print("NATIONAL CORPUS - Collocations for 'climate':")
    national_climate_colloc = national_conc.collocates(
        'climate',
        effect_size_measure='mutual_information',
        context_length=5,
        min_collocate_frequency=5
    )
    national_climate_colloc.display(20)
    
    print("\n" + "="*80 + "\n")
    
    # Collocations for "climate" in international corpus
    print("INTERNATIONAL CORPUS - Collocations for 'climate':")
    international_climate_colloc = international_conc.collocates(
        'climate',
        effect_size_measure='mutual_information',
        context_length=5,
        min_collocate_frequency=5
    )
    international_climate_colloc.display(20)
    
    # Save results
    national_climate_colloc.save(RESULTS_DIR / 'national_climate_collocates.txt')
    international_climate_colloc.save(RESULTS_DIR / 'international_climate_collocates.txt')
else:
    print("Collocations analysis requires both national and international corpora")

Collocations analysis requires both national and international corpora


## Concordance Analysis

Concordance displays uses keyword-in-context (KWIC) examples showing how specific terms are used in actual articles. This qualitative analysis complements the statistical measures above.

In [40]:
if has_national and has_international:
    # Concordance for "emissions" in national corpus
    print("NATIONAL CORPUS - Concordance for 'emissions':")
    national_emissions_conc = national_conc.concordance('emissions', window_size=50)
    national_emissions_conc.display(10)
    
    print("\n" + "="*80 + "\n")
    
    # Concordance for "emissions" in international corpus
    print("INTERNATIONAL CORPUS - Concordance for 'emissions':")
    international_emissions_conc = international_conc.concordance('emissions', window_size=50)
    international_emissions_conc.display(10)
    
    # Save results
    national_emissions_conc.save(RESULTS_DIR / 'national_emissions_concordance.txt')
    international_emissions_conc.save(RESULTS_DIR / 'international_emissions_concordance.txt')
else:
    print("Concordance analysis requires both national and international corpora")

Concordance analysis requires both national and international corpora


## Results Summary

The following sections display the saved analysis outputs.

In [None]:
# Display Keywords Analysis results
output_file = RESULTS_DIR / 'keywords_analysis.txt'
if output_file.exists():
    with open(output_file, 'r', encoding='utf-8') as f:
        print("KEYWORDS ANALYSIS:")
        print("="*80)
        print(f.read())
else:
    print("No keywords analysis output found.")
    print("The analysis requires both national and international corpora to run.")
    print("Build rnz-climate-international.corpus to generate results.")

IndentationError: unexpected indent (2828680093.py, line 11)

In [None]:
# Display National Climate Collocates
output_file = RESULTS_DIR / 'national_climate_collocates.txt'
if output_file.exists():
    with open(output_file, 'r', encoding='utf-8') as f:
        print("\nNATIONAL CORPUS - Climate Collocates:")
        print("="*80)
        print(f.read())
else:
    print("No national collocates output found.")
    print("The analysis requires both national and international corpora to run.")

In [None]:
# Display International Climate Collocates
output_file = RESULTS_DIR / 'international_climate_collocates.txt'
if output_file.exists():
    with open(output_file, 'r', encoding='utf-8') as f:
        print("\nINTERNATIONAL CORPUS - Climate Collocates:")
        print("="*80)
        print(f.read())
else:
    print("No international collocates output found.")
    print("The analysis requires both national and international corpora to run.")

## Summary of Key Findings

### Comparative Analysis Overview

This analysis uses the Conc library to compare **RNZ National** (domestic New Zealand) climate coverage with **RNZ International** (global) climate coverage, identifying what is distinctive about New Zealand's domestic climate discourse.

### Analysis Methods Applied

1. **Keywords Analysis**: Statistical identification of distinctive terms using Log Likelihood Ratio, Relative Risk, and Log Ratio measures
2. **Collocations**: Examining words that frequently co-occur with "climate" using Mutual Information scores
3. **Concordance (KWIC)**: Viewing actual usage examples in context to understand how terms are deployed

### Interpreting the Results

#### Keywords (Distinctive Terms)
- Terms with **high Relative Risk (>2)** are characteristic of New Zealand's domestic climate discourse
- Terms with **low Relative Risk (<0.5)** are more prominent in international climate reporting
- **Log Likelihood > 10.83** confirms differences are statistically highly significant (p < 0.001)

#### Collocations (Co-occurring Words)
- **MI Scores > 6** indicate strong lexical associations
- Compare national vs international collocates to see different framing patterns
- High-MI collocates reveal the semantic context around key climate terms

#### Concordance Examples
- KWIC lines show actual usage in news articles
- Reveals qualitative patterns in how climate topics are discussed
- Complements statistical measures with real discourse examples

### Research Questions (attempted)Addressed

- What climate terms are distinctively associated with New Zealand domestic coverage?
- How is "climate" framed differently in national vs international contexts?
- What vocabulary choices characterize New Zealand's climate discourse?

## Future Work 

This analysis can be extended with:

- **Additional search terms**: Collocations and concordance for "emissions", "carbon", "renewable"
- **N-grams**: Multi-word phrases distinctive to each corpus
- **Temporal analysis**: How coverage patterns change over time
- **Dispersion analysis**: Distribution of key terms across documents
- **Semantic patterns**: Topic modeling or word embeddings to identify discourse themes

All methods are available through the Conc library for systematic corpus linguistics analysis.