Skip to content

Zero-shot text classification of national AI and defense strategies

License

Notifications You must be signed in to change notification settings

ajkeith/StrategyDocumentAnalysis

Repository files navigation

AI and Defense Strategy: Text Analysis

DOI

coverage badge

This python project analyzes national AI and defense strategy documents using zero-shot text classification. The project focuses on Southeast Asia and nearby countries, specifically: Australia, Indonesia, Malaysia, Singapore, Thailand, and Vietnam.

Getting Started

python -m main.py

Usage

import os
from textanalysis import analysis

path = os.path.join(os.getcwd(), 'data', 'policies', 'australia_defense.pdf')
temp = analysis.extract_pdfs(path)
df, fig = analysis.analyze_corpus(temp)

The result of analyze_corpus is a dataframe of classified text (by topic and sentiment) and an interactive plot of the topic and sentiment by text chunk.

Algorithm Details

This code uses the facebook/bart-large-mnli large BART model from Hugging Face. This is a MutliNLI-tuned model based on BART and used here for zero-shot text classification.

This code also uses the distilbert-base-uncased-finetuned-sst-2-english model from Hugging Face. This is a fine-tuned model based on DistilBERT and used here for sentiment classification.

distilbert-base-uncased-finetuned-sst-2-english has strong evaluation results in terms of accuracy and precision:

However, it is also subject to risks, limitations, and biases.

Data

The national-level AI strategies or policies for GPAI and each country under consideration are included as .pdfs in the data/policies directory. The text-only version of those policies are included as .txts in the data/texts directory.

The membership assessment metrics for the Global Partnership on Artificial Intelligence (GPAI) are included in the data/metrics directory. This directory includes the source documents and consolidated metrics for the countries under consideration. The metrics are defined in the 2021 GPAI Frame for letter of intent and reference metrics to support the assessment of GPAI Membership (also available in the same directory). The datasets are organized with the following identifiers:

Identifier Dataset
aidv AI and Democratic Values Index
aigs AI Global Surveillance Index
aii Stanford AI Index
cri Commitment to Reducing Inequality Index
di Democracy Index
gai Global AI Index
gair Government AI Readiness Index
gfs Global Freedom Score
libdem V-Dem Liberal Democracy Index
odi Open Data Index
ttaip Total number of 10% top-cited AI scientific publications, fractional counts (source)

Intermediate data files and output figures and tables are included in the data/output directory.

Results

Exploratory analysis suggest that the approach is feasible. The following figure shows the sentiment and topic classficiation through Singapore's National AI Strategy.

text classificaiton figure